date red,heavy,new blue,light,old
1-2-20 320 120
2-3-20 220 125
I want to iterate through all rows and columns such that I can I can parse the column names and use them as values for new columns. I want to get a data of this format:
I want dates to be repeated. The 'value' col is from the original table.
date color weight condition. value
1-2-20 red heavy new 320
1-2-20 blue light. old. 120
2-3-20 red. heavy new. 220
I tried this and it worked for when I only had one column
colName = df_retransform.columns[1]
lst = colName.split(",")
color = lst[0]
weight = lst[1]
condition = lst[2]
df_retransform.rename(columns={colName: 'value'}, inplace=True)
df_retransform['color'] = color
df_retransform['weight'] = weight
df_retransform['condition'] = condition
but I am unable to modify it such that it I can do it for all columns.
Use DataFrame.melt with Series.str.split, DataFrame.pop is for using and drop column variable, last change order of columns names if necessary:
First you can test if all columns without data has 2 ,:
print ([col for col in df.columns if col.count(',') != 2])
['date']
df = df.melt('date')
df[['color', 'weight', 'condition']] = df.pop('variable').str.split(',', expand=True)
df = df[['date', 'color', 'weight', 'condition', 'value']]
print (df)
date color weight condition value
0 1-2-20 red heavy new 320
1 2-3-20 red heavy new 220
2 1-2-20 blue light old 120
3 2-3-20 blue light old 125
Or use DataFrame.stack for MultiIndex Series, then split and recreate new all levels for new columns:
print (df)
date red,heavy,new blue,light,old
0 1-2-20 320 NaN
1 NaN 220 125.0
s = df.set_index('date').stack(dropna=False)
s.index = pd.MultiIndex.from_tuples([(i, *j.split(',')) for i, j in s.index],
names=['date', 'color', 'weight', 'condition'])
df = s.reset_index(name='value')
print (df)
date color weight condition value
0 1-2-20 red heavy new 320.0
1 1-2-20 blue light old NaN
2 NaN red heavy new 220.0
3 NaN blue light old 125.0
You could also use pivot_longer function from pyjanitor; at the moment you have to install the latest development version from github:
# install latest dev version
# pip install git+https://github.com/ericmjl/pyjanitor.git
import janitor
df.pivot_longer(index="date",
names_to=("color", "weight", "condition"),
names_sep=",")
date color weight condition value
0 1-2-20 red heavy new 320
1 2-3-20 red heavy new 220
2 1-2-20 blue light old 120
3 2-3-20 blue light old 125
You pass the names of the new column to names_to, and specify the separator (,) in names_sep.
If you want it returned in order of appearance, you could pass boolean True to the sort_by_appearance argument:
df.pivot_longer(
index="date",
names_to=("color", "weight", "condition"),
names_sep=",",
sort_by_appearance=True,
)
date color weight condition value
0 1-2-20 red heavy new 320
1 1-2-20 blue light old 120
2 2-3-20 red heavy new 220
3 2-3-20 blue light old 125
Related
I have data for many countries over a period of time (2001-2003). It looks something like this:
index
year
country
inflation
GDP
1
2001
AFG
nan
48
2
2002
AFG
nan
49
3
2003
AFG
nan
50
4
2001
CHI
3.0
nan
5
2002
CHI
5.0
nan
6
2003
CHI
7.0
nan
7
2001
USA
nan
220
8
2002
USA
4.0
250
9
2003
USA
2.5
280
I want to drop countries in case there is no data (i.e. values are missing for all years) for any given variable.
In the example table above, I want to drop AFG (because it misses all values for inflation) and CHI (GDP missing). I don't want to drop observation #7 just because one year is missing.
What's the best way to do that?
This should work by filtering all values that have nan in one of (inflation, GDP):
(
df.groupby(['country'])
.filter(lambda x: not x['inflation'].isnull().all() and not x['GDP'].isnull().all())
)
Note, if you have more than two columns you can work on a more general version of this:
df.groupby(['country']).filter(lambda x: not x.isnull().all().any())
If you want this to work with a specific range of year instead of all columns, you can set up a mask and change the code a bit:
mask = (df['year'] >= 2002) & (df['year'] <= 2003) # mask of years
grp = df.groupby(['country']).filter(lambda x: not x[mask].isnull().all().any())
You can also try this:
# check where the sum is equal to 0 - means no values in the column for a specific country
group_by = df.groupby(['country']).agg({'inflation':sum, 'GDP':sum}).reset_index()
# extract only countries with information on both columns
indexes = group_by[ (group_by['GDP'] != 0) & ( group_by['inflation'] != 0) ].index
final_countries = list(group_by.loc[ group_by.index.isin(indexes), : ]['country'])
# keep the rows contains the countries
df = df.drop(df[~df.country.isin(final_countries)].index)
You could reshape the data frame from long to wide, drop nulls, and then convert back to wide.
To convert from long to wide, you can use pivot functions. See this question too.
Here's code for dropping nulls, after its reshaped:
df.dropna(axis=0, how= 'any', thresh=None, subset=None, inplace=True) # Delete rows, where any value is null
To convert back to long, you can use pd.melt.
I have a CSV file with a column that contains a string with vehicle options.
Brand Options
Toyota Color:Black,Wheels:18
Toyota Color:Black,
Chevy Color:Red,Wheels:16,Style:18
Honda Color:Green,Personalization:"Customer requested detailing"
Chevy Color:Black,Wheels:16
I want to expand the "Options" string to new columns with the appropriate name. The dataset is considerably large so I am trying to name the columns programmatically (ie: Color, Wheels, Personalization) then apply the respective value to the row or a null value.
Adding new data
import pandas as pd
Cars = pd.read_csv("Cars.CSV") # Loads cars into df
split = Cars["Options"].str.split(",", expand = True) # Data in form of {"Color:Black", "Wheels:16"}
split[0][0].split(":") # returns ['Color', 'Black']
What is an elegant way to concat these lists to the original dataframe Cars without specified columns?
You can prepare for a clean split by first using rstrip to avoid a null column, since you have one row with a comma at the end. Then, after splitting, explode to multiple rows and split again by :, this time using expand=True. Then, pivot the dataset into the desired format and concat back to the original dataframe:
pd.concat([df,
df['Options'].str.rstrip(',')
.str.split(',')
.explode()
.str.split(':', expand=True)
.pivot(values=1, columns=0)],
axis=1).drop('Options', axis=1)
Out[1]:
Brand Color Personalization Style Wheels
0 Toyota Black NaN NaN 18
1 Toyota Black NaN NaN NaN
2 Chevy Red NaN 18 16
3 Honda Green "Customer requested detailing" NaN NaN
4 Chevy Black NaN NaN 16
Following on from this question, is it possible to perform a similar 'broaden' operation in pandas where there are multiple source columns per 'entity'?
If my data now looks like:
Box,Code,Category
Green,1221,Active
Green,8391,Inactive
Red,3709,Inactive
Red,2911,Pending
Blue,9820,Active
Blue,4530,Active
How do I most efficiently get to:
Box,Code0,Category0,Code1,Category1
Green,1221,Active,8391,Inactive
Red,3709,Inactive,2911,Pending
Blue,9820,Active,4530,Active
So far, the only solution I have been able to put together that 'works', is follow the example from the linked page and to create two separate DataFrames, one grouped by Box and Code, the other grouped by Box and Category, and then merge the two together by Box.
a = get_clip.groupby('Box')['Code'].apply(list)
b = get_clip.groupby('Box')['Category'].apply(list)
broadeneda = pd.DataFrame(a.values.tolist(), index = a.index).add_prefix('Code').reset_index()
broadenedb = pd.DataFrame(b.values.tolist(), index = b.index).add_prefix('Category').reset_index()
merged = pd.merge(broadeneda, broadenedb, on='Box', how = 'inner')
Is there a way to achieve this without broadening each column separately and merging at the end?
gourpby + cumcount+unstack
df1=df.assign(n=df.groupby('Box').cumcount()).set_index(['Box','n']).unstack(1)
df1.columns=df1.columns.map('{0[0]}{0[1]}'.format)
df1
Out[141]:
Code0 Code1 Category0 Category1
Box
Blue 9820 4530 Active Active
Green 1221 8391 Active Inactive
Red 3709 2911 Inactive Pending
Option 1
Using set_index, pipe, and set_axis
df.set_index(['Box', df.groupby('Box').cumcount()]).unstack().pipe(
lambda d: d.set_axis(d.columns.map('{0[0]}{0[1]}'.format), 1, False)
)
Code0 Code1 Category0 Category1
Box
Blue 9820 4530 Active Active
Green 1221 8391 Active Inactive
Red 3709 2911 Inactive Pending
Option 2
Using defaultdict
from collections import defaultdict
d = defaultdict(dict)
for a, *b in df.values:
i = len(d[a]) // len(b)
c = (f'Code{i}', f'Category{i}')
d[a].update(dict(zip(c, b)))
pd.DataFrame.from_dict(d, 'index').rename_axis('Box')
Code0 Category0 Code1 Category1
Box
Blue 9820 Active 4530 Active
Green 1221 Active 8391 Inactive
Red 3709 Inactive 2911 Pending
This can be done with iteration of sub-dataframes:
cols = ["Box","Code0","Category0","Code1","Category1"]
newdf = pd.DataFrame(columns = cols) # create an empty dataframe to be filled
for box in pd.unique(df.Box): # for each color in Box
subdf = df[df.Box == box] # get a sub-dataframe
newrow = subdf.values[0].tolist() # get its values and then its full first row
newrow.extend(subdf.values[1].tolist()[1:3]) # add second and third entries of second row
newdf = pd.concat([newdf, pd.DataFrame(data=[newrow], columns=cols)], axis=0) # add to new dataframe
print(newdf)
Output:
Box Code0 Category0 Code1 Category1
0 Green 1221.0 Active 8391.0 Inactive
0 Red 3709.0 Inactive 2911.0 Pending
0 Blue 9820.0 Active 4530.0 Active
It seems that same color will appear in a row and each color has same rows. (Two important assumptions.) Thus, we can split the df into the odd part, df[::2], and the even part, df[1::2], and then merge it together.
pd.merge(df[::2], df[1::2], on="Box")
Box Code_x Category_x Code_y Category_y
0 Green 1221 Active 8391 Inactive
1 Red 3709 Inactive 2911 Pending
2 Blue 9820 Active 4530 Active
One can rename it easily by resetting its columns.
I hope the title is accurate enough, I wasn't quite sure how to phrase it.
Anyhow, my problem is that I have a Pandas df which looks like the following:
Customer Source CustomerSource
0 Apple A 141
1 Apple B 36
2 Microsoft A 143
3 Oracle C 225
4 Sun C 151
This is a df derived from a greater dataset, and the meaning the value of CustomerSource is that it's the accumulated sum of all occurrences of Customer and Source, for example, in this case there is 141 occurrences of Apple with Soure A and 225 of Customer Oracle with Source B and so on.
What I want to do with this, is I want to do a stacked barplot which gives me all Customers on the x-axis and the values of CustomerSource stacked on top of each other on the y-axis. Similar to the below example. Any hints as to how I would proceed with this?
You can use pivot or unstack for reshape and then DataFrame.bar:
df.pivot('Customer','Source','CustomerSource').plot.bar(stacked=True)
df.set_index(['Customer','Source'])['CustomerSource'].unstack().plot.bar(stacked=True)
Or if duplicates in pairs Customer, Source use pivot_table or groupby with aggregate sum:
print (df)
Customer Source CustomerSource
0 Apple A 141 <-same Apple, A
1 Apple A 200 <-same Apple, A
2 Apple B 36
3 Microsoft A 143
4 Oracle C 225
5 Sun C 151
df = df.pivot_table(index='Customer',columns='Source',values='CustomerSource', aggfunc='sum')
print (df)
Source A B C
Customer
Apple 341.0 36.0 NaN <-141 + 200 = 341
Microsoft 143.0 NaN NaN
Oracle NaN NaN 225.0
Sun NaN NaN 151.0
df.pivot_table(index='Customer',columns='Source',values='CustomerSource', aggfunc='sum')
.plot.bar(stacked=True)
df.groupby(['Customer','Source'])['CustomerSource'].sum().unstack().plot.bar(stacked=True)
Also is possible swap columns:
df.pivot('Customer','Source','CustomerSource').plot.bar(stacked=True)
df.pivot('Source', 'Customer','CustomerSource').plot.bar(stacked=True)
I have data that looks like this:
Box,Code
Green,1221
Green,8391
Red,3709
Red,2911
Blue,9820
Blue,4530
Using a pandas dataframe, I'm wondering if it is possible to output something like this:
Box,Code1,Code2
Green,1221,8391
Red,3709,2911
Blue,9820,4530
My data always has an equal number of rows per 'Box'.
I've been experimenting with pivots and crosstabs (as well as stack and unstack) in pandas but haven't found anything that gets me to the 'broaden' result I'm looking for.
You can use groupby for lists and then DataFrame constructor:
a = df.groupby('Box')['Code'].apply(list)
df = pd.DataFrame(a.values.tolist(), index=a.index).add_prefix('Code').reset_index()
print (df)
Box Code0 Code1
0 Blue 9820 4530
1 Green 1221 8391
2 Red 3709 2911
Or cumcount for new Series with pandas.pivot:
g = df.groupby('Box').cumcount()
df = pd.pivot(index=df['Box'], columns=g, values=df['Code']).add_prefix('Code').reset_index()
print (df)
Box Code0 Code1
0 Blue 9820 4530
1 Green 1221 8391
2 Red 3709 2911
And similar solution with unstack:
df['g'] = df.groupby('Box').cumcount()
df = df.set_index(['Box', 'g'])['Code'].unstack().add_prefix('Code').reset_index()
print (df)
g Box Code0 Code1
0 Blue 9820 4530
1 Green 1221 8391
2 Red 3709 2911