Splitting a dataframe column into multiple columns with specific names

Splitting a dataframe column into multiple columns with specific names - python

I am trying to split a dataframe column into multiple columns as under:
There are three columns overall. Two should rename in the new dataframe, while the third one to be split into new columns.
Split is to be done using a specific character (say ":")
The column that requires split can have varied number of ":" split. So new columns can be different for different rows, leaving some column values as NULL for some rows. That is okay.
Each subsequently formed column has a specific name. Max number of columns that can be formed is known.
There are four dataframes. Each one has this same formatted column that has to be split.
I came across following solutions but they don't work for the reasons mentioned:
Link
pd.concat([df[[0]], df[1].str.split(', ', expand=True)], axis=1)
This creates columns with names as 0,1,2... I need the new columns to have specific names.
Link
df = df.apply(lambda x:pd.Series(x))
This does no change to the dataframe. Couldn't understand why.
Link
df['command'], df['value'] = df[0].str.split().str
Here the column names are renamed properly, but this requires knowing beforehand how many columns will be formed. In my case, it is dynamic for each dataframe. For rows, the split successfully puts NULL value in extra columns. But using the same code for another dataframe generates an error saying number of keys should be same.
I couldn't post comments on these answers as I am new here on this community. I would appreciate if someone can help me understand how I can achieve my objective - which is: Dynamically use same code to split one column into many for different dataframes on multiple occasions while renaming the newly generated columns to predefined name.
For example:
Dataframe 1:
Col1 Col2 Col3
0 A A:B:C A
1 A A:B:C:D:E A
2 A A:B A
Dataframe 2:
Col1 Col2 Col3
0 A A:B:C A
1 A A:B:C:D A
2 A A:B A
Output should be:
New dataframe 1:
Col1 ColA ColB ColC ColD ColE Col3
0 A A B C NaN NaN A
1 A A B C D E A
2 A A B NaN NaN NaN A
New dataframe 2:
Col1 ColA ColB ColC ColD ColE Col3
0 A A B C NaN NaN A
1 A A B C D NaN A
2 A A B NaN NaN NaN A
(If ColE is not there, then also it is fine.)
After this, I will be concatenating these dataframes into one, where I will need counts of all ColA to ColE for individual dataframes against Col1 and Col3 combinations. So, we need to keep this in mind.

You can do it this way:
columns = df.Col2.max().split(':')
#['A', 'B', 'C', 'D', 'E']
new = df.Col2.str.split(":", expand = True)
new.columns = columns
new = new.add_prefix("Col")
df.join(new).drop("Col2", 1)
# Col1 Col3 ColA ColB ColC ColD ColE
#0 A A A B C None None
#1 A A A B C D E
#2 A A A B None None None

Related

How to merge rows by same value in different columns using Python (Pandas)

I have a data frame, something like this:
Id Col1 Col2 Paired_Id
1 a A
2 c B
A b 1
B d 2
I would like to merge the rows to get the output something like this. Delete the paired row after merging.
Id Col1 Col2 Paired_Id
1 a b A
2 c d B
Any hint?
So:
Merging rows (ID) with its Paired_ID entries.
Is this possible with Pandas?

Assuming NaNs in the empty cells, I would use a groupby.first with a frozenset of the two IDs as grouper:
group = df[['Id', 'Paired_Id']].apply(frozenset, axis=1)
out = df.groupby(group, as_index=False).first()
Output:
Id Col1 Col2 Paired_Id
0 1 a b A
1 2 c d B

Don't have a lot of information about the structure of your dataframe, so I will just assume a few things - please correct me if I'm wrong:
A line with an entry in Col1 will never have an entry in Col2.
Corresponding lines appear in the same sequence (lines 1,2,3... then
corresponding lines 1,2,3...)
Every line has a corresponding second line later on in the dataframe
If all those assumptions are correct, you could split your data into two dataframes, df_upperhalf containing the Col1, df_lowerhalf the Col2.
df_upperhalf = df.iloc[:len(df.index),]
df_lowerhalf = df.iloc[(len(df.index)*(-1):,]
Then you can easily combine those values:
df_combined = df_upperhalf
df_combined['Col2'] = df_lowerhalf['Col2']
If some of my assumptions are incorrect, this will of course not produce the results you want.
There are also quite a few ways to do it in fewer lines of code, but I think this way you end up with nicer dataframes and the code should be easily readable.
Edit:
I think this would be quite a bit faster:
df_upperhalf = df.head(len(df.index))
df_lowerhalf = df.tail(len(df.index))

How to search and select column names based on values?

Suppose I have a pandas DataFrame like this
name col1 col2 col3
0 AAA 1 0 2
1 BBB 2 1 2
2 CCC 0 0 2
I want (a) the names of any columns that contain a value of 2 anywhere in the column (i.e., col1, col3), and (b) the names of any columns that contain only values of 2 (i.e., col3).
I understand how to use DataFrame.any() and DataFrame.all() to select rows in a DataFrame where a value appears in any or all columns, but I'm trying to find COLUMNS where a value appears in (a) any or (b) all rows.

You can do what you described with columns:
df.columns[df.eq(2).any()]
# Index(['col1', 'col3'], dtype='object')
df.columns[df.eq(2).all()]
# Index(['col3'], dtype='object')

You can loop over the columns (a):
for column in df.columns:
if (df[column]==2).any(axis=None)==True:
print(column +"contains 2")
Here you get the name of the columns containing one or more 2.
(b) :
for column in df.columns:
if (df[column]==2).all(axis=None)==True:
print(column +"contains 2")

Pandas: slice Dataframe according to values of a column

I have to slice my Dataframe according to values (imported from a txt) that occur in one of my Dataframe' s column. This is what I have:
>df
col1 col2
a 1
b 2
c 3
d 4
>'mytxt.txt'
2
3
This is what I need: drop rows whenever value in col2 is not among values in mytxt.txt
Expected result must be:
>df
col1 col2
b 2
c 3
I tried:
values = pd.read_csv('mytxt.txt', header=None)
df = df.col2.isin(values)
But it doesn' t work. Help would be very appreciated, thanks!

When you read values, I would do it as a Series, and then convert it to a set, which will be more efficient for lookups:
values = pd.read_csv('mytxt.txt', header=None, squeeze=True)
values = set(values.tolist())
Then slicing will work:
>>> df[df.col2.isin(values)]
col1 col2
1 b 2
2 c 3
What was happening is you were reading values in as a DataFrame rather than a Series, so the .isin method was not behaving as you expected.

change nan values in pandas

In my code the df.fillna() method is not working when the df.dropna() method is working. I don't want to drop the column though. What can I do that the fillna() method works?
def preprocess_df(df):
for col in df.columns: # go through all of the columns
if col != "target": # normalize all ... except for the target itself!
df[col] = df[col].pct_change() # pct change "normalizes" the different currencies (each crypto coin has vastly diff values, we're really more interested in the other coin's movements)
# df.dropna(inplace=True) # remove the nas created by pct_change
df.fillna(method="ffill", inplace=True)
print(df)
break
df[col] = preprocessing.scale(df[col].values) # scale between 0 and 1.

it should work unless its not within loop as mentioned..
You should consider filling it before you construct a loop or during the DataFrame construction:
Example Below cleary shows it working :
>>> df
col1
0 one
1 NaN
2 two
3 NaN
Works as expected:
>>> df['col1'].fillna( method ='ffill') # This is showing column specific to `col1`
0 one
1 one
2 two
3 two
Name: col1, dtype: object
Secondly, if you wish to change few selective columns then you use below method:
Let's suppose you have 3 columns and want to fillna with ffill for only 2 columns.
>>> df
col1 col2 col3
0 one test new
1 NaN NaN NaN
2 two rest NaN
3 NaN NaN NaN
Define the columns to be changed..
cols = ['col1', 'col2']
>>> df[cols] = df[cols].fillna(method ='ffill')
>>> df
col1 col2 col3
0 one test new
1 one test NaN
2 two rest NaN
3 two rest NaN
If you are considering it to be happen across entire DataFrame, the use it during as Follows:
>>> df
col1 col2
0 one test
1 NaN NaN
2 two rest
3 NaN NaN
>>> df.fillna(method ='ffill') # inplace=True if you considering as you wish for permanent change.
col1 col2
0 one test
1 one test
2 two rest
3 two rest

the first value was a NaN so I had to use bfill method instead. Thanks everyone

Python Pandas - Appending data from multiple data frames onto same row by matching primary identifier, leave blank if no results from that data frame

Very new to python and using pandas, I only use it every once in a while when I'm trying to learn and automate otherwise a tedious Excel task. I've come upon a problem where I haven't exactly been able to find what I'm looking for through Google or here on Stack Overflow.
I currently have 6 different excel (.xlsx) files that I am able to parse and read into data frames. However, whenever I try to append them together, they're simply added on as new rows in a final output excel files, but instead I'm trying to append similar data values onto the same row, and not the same column so that I can see whether or not this unique value shows up in these data sets or not. A shortened example is as follows
[df1]
0 Col1 Col2
1 XYZ 41235
2 OAIS 15123
3 ABC 48938
[df2]
0 Col1 Col2
1 KFJ 21493
2 XYZ 43782
3 SHIZ 31299
4 ABC 33347
[Expected Output]
0 Col1 [df1] [df2]
1 XYZ 41235 43782
2 OAIS 15123
3 ABC 48938 33347
4 KFJ 21493
5 SHIZ 31299
I've tried to use a merge, however the actual data sheets are much more complicated in that I want to append 23 columns of data associated with each unique identifier in each data set. Such as, [XYZ] in [df2] has associated information across the next 23 columns that I would want to append after the 23 columns from the [XYZ] values in [df1].
How should I go about that? There are approximately 200 rows in each excel sheet and I would only need to essentially loop through until a matching unique identifier was found in [df2] with [df1], and then [df3] with [df1] and so on until [df6] and append those columns onto a new dataframe which would eventually be output as a new excel file.
df1 = pd.read_excel("set1.xlsx")
df2 = pd.read_excel("set2.xlsx")
df3 = pd.read_excel("set3.xlsx")
df4 = pd.read_excel("set4.xlsx")
df5 = pd.read_excel("set5.xlsx")
df6 = pd.read_excel("set6.xlsx")
Is currently the way I am reading into the excel files into data frames, I'm sure I could loop it, however, I am unsure of the best practices in doing so instead of hard coding each initialization of the data frame.

You need merge with the parameter how = 'outer'
new_df = df1.merge(df2, on = 'Col1',how = 'outer', suffixes=('_df1', '_df2'))
You get
Col1 Col2_df1 Col2_df2
0 XYZ 41235.0 43782.0
1 OAIS 15123.0 NaN
2 ABC 48938.0 33347.0
3 KFJ NaN 21493.0
4 SHIZ NaN 31299.0

For iterative merging, consider storing data frames in a list and then run the chain merge with reduce(). Below creates a list of dataframes from a list comprehension through the Excel files where enumerate() is used to rename the Col2 successively as df1, df2, etc.
from functools import reduce
...
dfList = [pd.read_excel(xl).rename(columns={'Col2':'df'+str(i)})
for i,xl in enumerate(["set1.xlsx", "set2.xlsx", "set3.xlsx",
"set4.xlsx", "set5.xlsx", "set6.xlsx"])]
df = reduce(lambda x,y: pd.merge(x, y, on=['Col1'], how='outer'), dfList)
# Col1 df1 df2
# 0 XYZ 41235.0 43782.0
# 1 OAIS 15123.0 NaN
# 2 ABC 48938.0 33347.0
# 3 KFJ NaN 21493.0
# 4 SHIZ NaN 31299.0
Alternatively, use pd.concat and outer join the dataframes horizontally where you need to set Col1 as index:
dfList = [pd.read_excel(xl).rename(columns={'Col2':'df'+str(i)}).set_index('Col1')
for i,xl in enumerate(["set1.xlsx", "set2.xlsx", "set3.xlsx",
"set4.xlsx", "set5.xlsx", "set6.xlsx"])]
df2 = pd.concat(dfList, axis=1, join='outer', copy=False)\
.reset_index().rename(columns={'index':'Col1'})
# Col1 df1 df2
# 0 ABC 48938.0 33347.0
# 1 KFJ NaN 21493.0
# 2 OAIS 15123.0 NaN
# 3 SHIZ NaN 31299.0
# 4 XYZ 41235.0 43782.0

You can use the merge function.
pd.merge(df1, df2, on=['Col1'])
You can use multiple keys by adding to the list on.
You can read more about the merge function in here
If you need only certain of the columns you can reach it by:
df1.merge(df2['col1','col2']], on=['Col1'])
EDIT:
In case of looping through some df's you can loop through all df's except the first and merge them all:
df_list = [df2, df3, df4]
for df in df_list:
df1 = df1.merge(df['col1','col2']], on=['Col1'])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.