Python: Pivot Table/group by specific conditions

Python: Pivot Table/group by specific conditions - python

I'm trying to change structure of my data from text file(.txt) which data look like this:
:1:A
:2:B
:3:C
:1:D
:2:E
:3:F
:4:G
:1:H
:3:I
:4:J
And I would like to transform them into this format (like pivot-table in excel which column name is character between ":" and each group always start with :1:)
Group :1: :2: :3: :4:
1 A B C
2 D E F G
3 H I J
Does anyone have any idea? Thanks in advance.

First create DataFrame by read_csv with header=None, because no header in file:
import pandas as pd
temp=u""":1:A
:2:B
:3:C
:1:D
:2:E
:3:F
:4:G
:1:H
:3:I
:4:J"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), header=None)
print (df)
0
0 :1:A
1 :2:B
2 :3:C
3 :1:D
4 :2:E
5 :3:F
6 :4:G
7 :1:H
8 :3:I
9 :4:J
Extract original column by DataFrame.pop, then remove traling : by Series.str.strip and Series.str.split values to 2 new columns. Then create groups by compare with Series.eq for == by string 0 with Series.cumsum, create MultiIndex by DataFrame.set_index and last reshape by Series.unstack:
df[['a','b']] = df.pop(0).str.strip(':').str.split(':', expand=True)
df1 = df.set_index([df['a'].eq('1').cumsum(), 'a'])['b'].unstack(fill_value='')
print (df1)
a 1 2 3 4
a
1 A B C
2 D E F G
3 H I J

Use:
# Reading text file (assuming stored in CSV format, you can also use pd.read_fwf)
df = pd.read_csv('SO.csv', header=None)
# Splitting data into two columns
ndf = df.iloc[:, 0].str.split(':', expand=True).iloc[:, 1:]
# Grouping and creating a dataframe. Later dropping NaNs
res = ndf.groupby(1)[2].apply(pd.DataFrame).apply(lambda x: pd.Series(x.dropna().values))
# Post processing (optional)
res.columns = [':' + ndf[1].unique()[i] + ':' for i in range(ndf[1].nunique())]
res.index.name = 'Group'
res.index = range(1, res.shape[0] + 1)
res
Group :1: :2: :3: :4:
1 A B C
2 D E F G
3 H I J

Another way to do this:
#read the file
with open("t.txt") as f:
content = f.readlines()
#Create a dictionary and read each line from file to keep the column names (ex, :1:) as keys and rows(ex, A) as values in dictionary.
my_dict={}
for v in content:
key = v.rstrip(':')[0:3] # take the value ':1:'
value = v.rstrip(':')[3] # take value 'A'
my_dict.setdefault(key,[]).append(value)
#convert dictionary to dataframe and transpose it
df = pd.DataFrame.from_dict(my_dict,orient='index').transpose()
df
The output will be looking like this:
:1: :2: :3: :4:
0 A B C G
1 D E F J
2 H None I None

Related

Make a new dataframe from multiple dataframes

Suppose I have 3 dataframes that are wrapped in a list. The dataframes are:
df_1 = pd.DataFrame({'text':['a','b','c','d','e'],'num':[2,1,3,4,3]})
df_2 = pd.DataFrame({'text':['f','g','h','i','j'],'num':[1,2,3,4,3]})
df_3 = pd.DataFrame({'text':['k','l','m','n','o'],'num':[6,5,3,1,2]})
The list of the dfs is:
df_list = [df_1, df_2, df_3]
Now I want to make a for loop such that goes on df_list, and for each df takes the text column and merge them on a new dataframe with a new column head called topic. Now since each text column is different from each dataframe I want to populate the headers as topic_1, topic_2, etc. The desired outcome should be as follow:
topic_1 topic_2 topic_3
0 a f k
1 b g l
2 c h m
3 d i n
4 e j o
I can easily extract the text columns as:
lst = []
for i in range(len(df_list)):
lst.append(df_list[i]['text'].tolist())
It is just that I am stuck on the last part, namely bringing the columns into 1 df without using brute force.

You can extract the wanted columns with a list comprehension and concat them:
pd.concat([d['text'].rename(f'topic_{i}')
for i,d in enumerate(df_list, start=1)],
axis=1)
output:
topic_1 topic_2 topic_3
0 a f k
1 b g l
2 c h m
3 d i n
4 e j o

Generally speaking you want to avoid looping anything on a pandas DataFrame. However, in this solution I do use a loop to rename your columns. This should work assuming you just have these 3 dataframes:
import pandas as pd
df_1 = pd.DataFrame({'text':['a','b','c','d','e'],'num':[2,1,3,4,3]})
df_2 = pd.DataFrame({'text':['f','g','h','i','j'],'num':[1,2,3,4,3]})
df_3 = pd.DataFrame({'text':['k','l','m','n','o'],'num':[6,5,3,1,2]})
df_list = [df_1.text, df_2.text, df_3.text]
df_combined = pd.concat(df_list,axis=1)
df_combined.columns = [f"topic_{i+1}" for i in range(len(df_combined.columns))]
>>> df_combined
topic_1 topic_2 topic_3
0 a f k
1 b g l
2 c h m
3 d i n
4 e j o

How to replace data in one pandas df by the data of another one?

Want to replace some rows of some columns in a bigger pandas df by data in a smaller pandas df. The column names are same in both.
Tried using combine_first but it only updates the null values.
For example lets say df1.shape is 100, 25 and df2.shape is 10,5
df1
A B C D E F G ...Z Y Z
1 abc 10.20 0 pd.NaT
df2
A B C D E
1 abc 15.20 1 10
Now after replacing df1 should look like:
A B C D E F G ...Z Y Z
1 abc 15.20 1 10 ...
To replace values in df1 the condition is where df1.A = df2.A and df1.B = df2.B
How can it be achieved in the most pythonic way? Any help will be appreciated.

Don't know I really understood your question does this solves your problem ?
df1 = pd.DataFrame(data={'A':[1],'B':[2],'C':[3],'D':[4]})
df2 = pd.DataFrame(data={'A':[1],'B':[2],'C':[5],'D':[6]})
new_df=pd.concat([df1,df2]).drop_duplicates(['A','B'],keep='last')
print(new_df)
output:
A B C D
0 1 2 5 6

You could play with Multiindex.
First let us create those dataframe that you are working with:
cols = pd.Index(list(ascii_uppercase))
vals = np.arange(100*len(cols)).reshape(100, len(cols))
df = pd.DataFrame(vals, columns=cols)
df1 = pd.DataFrame(vals[:10,:5], columns=cols[:5])
Then transform A and B in indices:
df = df.set_index(["A","B"])
df1 = df1.set_index(["A","B"])*1.5 # multiply just to make the other values different
df.loc[df1.index, df1.columns] = df1
df = df.reset_index()

Renaming columns on slice of dataframe not performing as expected

I was trying to clean up column names in a dataframe but only a part of the columns.
It doesn't work when trying to replace column names on a slice of the dataframe somehow, why is that?
Lets say we have the following dataframe:
Note, on the bottom is copy-able code to reproduce the data:
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
I want to clean up the column names (expected output):
Value ColA ColB ColC
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Approach 1:
I can get the clean column names like this:
df.iloc[:, 1:].columns.str[:4]
Index(['ColA', 'ColB', 'ColC'], dtype='object')
Or
Approach 2:
s = df.iloc[:, 1:].columns
[col[:4] for col in s]
['ColA', 'ColB', 'ColC']
But when I try to overwrite the column names, nothing happens:
df.iloc[:, 1:].columns = df.iloc[:, 1:].columns.str[:4]
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Same for the second approach:
s = df.iloc[:, 1:].columns
cols = [col[:4] for col in s]
df.iloc[:, 1:].columns = cols
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
This does work, but you have to manually concat the name of the first column, which is not ideal:
df.columns = ['Value'] + df.iloc[:, 1:].columns.str[:4].tolist()
Value ColA ColB ColC
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Is there an easier way to achieve this? Am I missing something?
Dataframe for reproduction:
df = pd.DataFrame({'Value':[1,2,3,4],
'ColAfjkj':['a', 'b', 'c', 'd'],
'ColBhuqwa':['e', 'f', 'g', 'h'],
'ColCouiqw':['i', 'j', 'k', 'l']})

This is because pandas' index is immutable. If you check the documentation for class pandas.Index, you'll see that it is defined as:
Immutable ndarray implementing an ordered, sliceable set
So in order to modify it you'll have to create a new list of column names, for instance with:
df.columns = [df.columns[0]] + list(df.iloc[:, 1:].columns.str[:4])
Another option is to use rename with a dictionary containing the columns to replace:
df.rename(columns=dict(zip(df.columns[1:], df.columns[1:].str[:4])))

To overwrite columns names you can .rename() method:
So, it will look like:
df.rename(columns={'ColA_fjkj':'ColA',
'ColB_huqwa':'ColB',
'ColC_ouiqw':'ColC'}
, inplace=True)
More info regarding rename here in docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html

I had this problem as well and came up with this solution:
First, create a mask of the columns you want to rename
mask = df.iloc[:,1:4].columns
Then, use list comprehension and a conditional to rename just the columns you want
df.columns = [x if x not in mask else str[:4] for x in df.columns]

ffill not filling data in pandas dataframe

I have a dataframe like this :
A B C E D
---------------
0 a r g g
1 x
2 x f f r
3 t
3 y
I am trying for forward filling using ffill. It is not working
cols = df.columns[:4].tolist()
df[cols] = df[cols].ffill()
I also tried :
df[cols] = df[cols].fillna(method='ffill')
But it is not getting filled.
Is it the empty columns in data causing this issue?
Data is mocked. Exact data is different (contains strings,numbers and empty columns)
desired o/p:
A B C E D
---------------
0 a r g g
1 a r g x
2 x f f r
3 x f f t
3 x f f y

Replace empty values in subset of columns by NaN:
df[cols] = df[cols].replace('', np.nan).ffill()

You should replace the empty strings with np.NaN before:
df = df.replace('', np.NaN)
df[cols] = df[cols].ffill()

Replace '' with np.nan first:
df[df='']=np.nan
df[cols] = df[cols].ffill()

Add calculated column to pandas dataframe on the fly while iterating over the lines of a csv file?

I have a large space separated input file input.csv, which I can't hold in memory:
## Header
# More header here
A B
1 2
3 4
If I use the iterator=True argument for pandas.read_csv, then it returns a TextFileReader / TextParser object. This allows filtering the file on the fly and only selecting rows for which column A is greater than 2.
But how do I add a third column to the dataframe on the fly without having to loop over all of the data once more?
Specifically I want column C to be equal to column A multiplied by the value in a dictionary d, which has the value of column B as its key; i.e. C = A*d[B].
Currently I have this code:
import pandas
d = {2: 2, 4: 3}
TextParser = pandas.read_csv('input.csv', sep=' ', iterator=True, comment='#')
df = pandas.concat([chunk[chunk['A'] > 2] for chunk in TextParser])
print(df)
Which prints this output:
A B
1 3 4
How do I get it to print this output (C = A*d[B]):
A B C
1 3 4 9

You can use a generator to work on the chunks one at a time:
Code:
def on_the_fly(the_csv):
d = {2: 2, 4: 3}
chunked_csv = pd.read_csv(
the_csv, sep='\s+', iterator=True, comment='#')
for chunk in chunked_csv:
rows_idx = chunk['A'] > 2
chunk.loc[rows_idx, 'C'] = chunk[rows_idx].apply(
lambda x: x.A * d[x.B], axis=1)
yield chunk[rows_idx]
Test Code:
from io import StringIO
data = StringIO(u"""#
A B
1 2
3 4
4 4
""")
import pandas as pd
df = pd.concat([c for c in on_the_fly(data)])
print(df)
Results:
A B C
1 3 4 9.0
2 4 4 12.0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Pivot Table/group by specific conditions - python

Related

Make a new dataframe from multiple dataframes

How to replace data in one pandas df by the data of another one?

Renaming columns on slice of dataframe not performing as expected

ffill not filling data in pandas dataframe

Add calculated column to pandas dataframe on the fly while iterating over the lines of a csv file?

Categories

Resources