Lets say I'm working on a dataset: # dummy dataset
import pandas as pd
data = pd.DataFrame({"Name_id" : ["John","Deep","Julia","John","Sandy",'Deep'],
"Month_id" : ["December","March","May","April","May","July"],
"Colour_id" : ["Red",'Purple','Green','Black','Yellow','Orange']})
data
How can I convert this data frame into something like this:
Where the A_id is unique and forms new columns based on both the value and the existence / non-existence of the other columns in order of appearance? I have tried to use pivot but I noticed it's more used for numerical data instead of categorical.
Probably you should try pivot
data['Rowid'] = data.groupby('Name_id').cumcount()+1
d = data.pivot(index='Name_id', columns='Rowid',values = ['Month_id','Colour_id'])
d.reset_index(inplace=True)
d.columns = ['Name_id','Month_id1', 'Colour_id1', 'Month_id2', 'Colour_id2']
which gives
Name_id Month_id1 Colour_id1 Month_id2 Colour_id2
0 Deep March July Purple Orange
1 John December April Red Black
2 Julia May NaN Green NaN
3 Sandy May NaN Yellow NaN
Related
I am using the IMDB dataset for machine learning, and it contains a lot of missing values which are entered as '\N'. Specifically in the StartYear column which contains the movie year release I want to convert the values to integers. Which im not able to do right now, I could drop these values but I wanted to see why they're missing first. I tried several things but no success.
This is my latest attempt:
Here is a way to do it without using replace:
import pandas as pd
import numpy as np
df_basics = pd.DataFrame({'startYear':['\\N']*78760+[2017]*18267 + [2018]*18263+[2016]*17837+[2019]*17769+['1996 ','1993 ','2000 ','2019 ','2029 ']})
print(pd.value_counts(df_basics.startYear))
df_basics.loc[df_basics.startYear == '\\N','startYear'] = np.NaN
print(pd.value_counts(df_basics.startYear, dropna=False))
Output:
NaN 78760
2017 18267
2018 18263
2016 17837
2019 17769
1996 1
1993 1
2000 1
2019 1
2029 1
I have data for many countries over a period of time (2001-2003). It looks something like this:
index
year
country
inflation
GDP
1
2001
AFG
nan
48
2
2002
AFG
nan
49
3
2003
AFG
nan
50
4
2001
CHI
3.0
nan
5
2002
CHI
5.0
nan
6
2003
CHI
7.0
nan
7
2001
USA
nan
220
8
2002
USA
4.0
250
9
2003
USA
2.5
280
I want to drop countries in case there is no data (i.e. values are missing for all years) for any given variable.
In the example table above, I want to drop AFG (because it misses all values for inflation) and CHI (GDP missing). I don't want to drop observation #7 just because one year is missing.
What's the best way to do that?
This should work by filtering all values that have nan in one of (inflation, GDP):
(
df.groupby(['country'])
.filter(lambda x: not x['inflation'].isnull().all() and not x['GDP'].isnull().all())
)
Note, if you have more than two columns you can work on a more general version of this:
df.groupby(['country']).filter(lambda x: not x.isnull().all().any())
If you want this to work with a specific range of year instead of all columns, you can set up a mask and change the code a bit:
mask = (df['year'] >= 2002) & (df['year'] <= 2003) # mask of years
grp = df.groupby(['country']).filter(lambda x: not x[mask].isnull().all().any())
You can also try this:
# check where the sum is equal to 0 - means no values in the column for a specific country
group_by = df.groupby(['country']).agg({'inflation':sum, 'GDP':sum}).reset_index()
# extract only countries with information on both columns
indexes = group_by[ (group_by['GDP'] != 0) & ( group_by['inflation'] != 0) ].index
final_countries = list(group_by.loc[ group_by.index.isin(indexes), : ]['country'])
# keep the rows contains the countries
df = df.drop(df[~df.country.isin(final_countries)].index)
You could reshape the data frame from long to wide, drop nulls, and then convert back to wide.
To convert from long to wide, you can use pivot functions. See this question too.
Here's code for dropping nulls, after its reshaped:
df.dropna(axis=0, how= 'any', thresh=None, subset=None, inplace=True) # Delete rows, where any value is null
To convert back to long, you can use pd.melt.
I have a CSV file with a column that contains a string with vehicle options.
Brand Options
Toyota Color:Black,Wheels:18
Toyota Color:Black,
Chevy Color:Red,Wheels:16,Style:18
Honda Color:Green,Personalization:"Customer requested detailing"
Chevy Color:Black,Wheels:16
I want to expand the "Options" string to new columns with the appropriate name. The dataset is considerably large so I am trying to name the columns programmatically (ie: Color, Wheels, Personalization) then apply the respective value to the row or a null value.
Adding new data
import pandas as pd
Cars = pd.read_csv("Cars.CSV") # Loads cars into df
split = Cars["Options"].str.split(",", expand = True) # Data in form of {"Color:Black", "Wheels:16"}
split[0][0].split(":") # returns ['Color', 'Black']
What is an elegant way to concat these lists to the original dataframe Cars without specified columns?
You can prepare for a clean split by first using rstrip to avoid a null column, since you have one row with a comma at the end. Then, after splitting, explode to multiple rows and split again by :, this time using expand=True. Then, pivot the dataset into the desired format and concat back to the original dataframe:
pd.concat([df,
df['Options'].str.rstrip(',')
.str.split(',')
.explode()
.str.split(':', expand=True)
.pivot(values=1, columns=0)],
axis=1).drop('Options', axis=1)
Out[1]:
Brand Color Personalization Style Wheels
0 Toyota Black NaN NaN 18
1 Toyota Black NaN NaN NaN
2 Chevy Red NaN 18 16
3 Honda Green "Customer requested detailing" NaN NaN
4 Chevy Black NaN NaN 16
I'm having some difficulties to convert the values from object to float!!!
I saw some examples but I couldn't be able to get it.
I would like to have a for loop to convert the values in all columns.
I didn't have yet a script cause I saw different ways to do it
Terawatt-hours Total Asia Pacific Total CIS Total Europe
2000 0.428429 0 0.134473
2001 0.608465 0 0.170166
2002 0.829254 0 0.276783
2003 1.11654 0 0.468726
2004 1.46406 0 0.751126
2005 1.85281 0 1.48641
2006 2.29128 0 2.52412
2007 2.74858 0 3.81573
2008 3.3306 0 7.5011
2009 4.3835 7.375e-06 14.1928
2010 6.73875 0.000240125 23.2634
2011 12.1544 0.00182275 46.7135
I tried this:
df = pd.read_excel(r'bp-stats-review-2019-all-data.xls')
columns = list(df.head(0))
for i in range(len(columns)):
df[columns[i]].astype(float)
Your question is not clear as to which column you are trying to convert, So I am sharing the example for the 1st column in your screenshot.
df['Terawatt-hours'] = df.Terawatt-hours.astype(float)
or same for any other column.
EDIT
for creating a loop on the dataframe and change it for all the columns, you can do the following :
Generating a dummy dataframe
df = pd.DataFrame(np.random.randint(0, 100, size=(20, 4)), columns=list('abcd'))
Check the type of column in dataframe :
for column in df.columns:
print(df[column].dtype)
Change the type of all the columns to float
for column in df.columns:
df[column] = df[column].astype(float)
your question is not clear? which columns are you trying to convert to float and also post what you have done.
EDIT:
what you tried is right until the last line of your code where you failed to reassign the columns.
df[columns[i]] = df[columns[i]].astype(float)
also try using df.columns to get column names instead of list(df.head(0))
the link here to pandas docs on how to cast a pandas object to a specified dtype
I have a dataset that is similar in this format:
CITY - YEAR - ATTRIBUTE - VALUE
## example:
dallas-2002-crime-100
dallas-2003-crime-101
dallas-2002-population-4000
houston-2002-population-4100
etc....
I'm trying to transpose this long to wide format so that each city+year value is a row and all the distinct combinations of attributes are the columns-names.
Thus this new dataframe would look like:
###
city - year - population - crime - median_income- etc....
I've looked at the pivot function, but it doesn't seem to support a multi-index for reshaping. Can someone let me know how to work around transposing? Additionally, I tried to look at
pd.pivot_table but it seems this typically only works with numerical data with sums,means, etc. Most of my VALUE attributes are actually strings, so I don't seem to be able to use this.
### doesn't work - can't use a multindex
df.pivot(index=['city','year'], columns = 'attribute', values='value')
Thank you for your help!
Is this what you are looking for:
import pandas as pd
from io import StringIO
data = """city-year-attribute-value
dallas-2002-crime-100
dallas-2003-crime-101
dallas-2002-population-4000
houston-2002-population-4100"""
df = pd.read_csv(StringIO(data), sep="-")
pivoted = df.pivot_table(
index=["city", "year"],
columns=["attribute"],
values=["value"]
)
print(pivoted.reset_index())
Result:
city year value
attribute crime population
0 dallas 2002 100.0 4000.0
1 dallas 2003 101.0 NaN
2 houston 2002 NaN 4100.0