I'm having some difficulties to convert the values from object to float!!!
I saw some examples but I couldn't be able to get it.
I would like to have a for loop to convert the values in all columns.
I didn't have yet a script cause I saw different ways to do it
Terawatt-hours Total Asia Pacific Total CIS Total Europe
2000 0.428429 0 0.134473
2001 0.608465 0 0.170166
2002 0.829254 0 0.276783
2003 1.11654 0 0.468726
2004 1.46406 0 0.751126
2005 1.85281 0 1.48641
2006 2.29128 0 2.52412
2007 2.74858 0 3.81573
2008 3.3306 0 7.5011
2009 4.3835 7.375e-06 14.1928
2010 6.73875 0.000240125 23.2634
2011 12.1544 0.00182275 46.7135
I tried this:
df = pd.read_excel(r'bp-stats-review-2019-all-data.xls')
columns = list(df.head(0))
for i in range(len(columns)):
df[columns[i]].astype(float)
Your question is not clear as to which column you are trying to convert, So I am sharing the example for the 1st column in your screenshot.
df['Terawatt-hours'] = df.Terawatt-hours.astype(float)
or same for any other column.
EDIT
for creating a loop on the dataframe and change it for all the columns, you can do the following :
Generating a dummy dataframe
df = pd.DataFrame(np.random.randint(0, 100, size=(20, 4)), columns=list('abcd'))
Check the type of column in dataframe :
for column in df.columns:
print(df[column].dtype)
Change the type of all the columns to float
for column in df.columns:
df[column] = df[column].astype(float)
your question is not clear? which columns are you trying to convert to float and also post what you have done.
EDIT:
what you tried is right until the last line of your code where you failed to reassign the columns.
df[columns[i]] = df[columns[i]].astype(float)
also try using df.columns to get column names instead of list(df.head(0))
the link here to pandas docs on how to cast a pandas object to a specified dtype
Related
Lets say I'm working on a dataset: # dummy dataset
import pandas as pd
data = pd.DataFrame({"Name_id" : ["John","Deep","Julia","John","Sandy",'Deep'],
"Month_id" : ["December","March","May","April","May","July"],
"Colour_id" : ["Red",'Purple','Green','Black','Yellow','Orange']})
data
How can I convert this data frame into something like this:
Where the A_id is unique and forms new columns based on both the value and the existence / non-existence of the other columns in order of appearance? I have tried to use pivot but I noticed it's more used for numerical data instead of categorical.
Probably you should try pivot
data['Rowid'] = data.groupby('Name_id').cumcount()+1
d = data.pivot(index='Name_id', columns='Rowid',values = ['Month_id','Colour_id'])
d.reset_index(inplace=True)
d.columns = ['Name_id','Month_id1', 'Colour_id1', 'Month_id2', 'Colour_id2']
which gives
Name_id Month_id1 Colour_id1 Month_id2 Colour_id2
0 Deep March July Purple Orange
1 John December April Red Black
2 Julia May NaN Green NaN
3 Sandy May NaN Yellow NaN
I have a dataset that is similar in this format:
CITY - YEAR - ATTRIBUTE - VALUE
## example:
dallas-2002-crime-100
dallas-2003-crime-101
dallas-2002-population-4000
houston-2002-population-4100
etc....
I'm trying to transpose this long to wide format so that each city+year value is a row and all the distinct combinations of attributes are the columns-names.
Thus this new dataframe would look like:
###
city - year - population - crime - median_income- etc....
I've looked at the pivot function, but it doesn't seem to support a multi-index for reshaping. Can someone let me know how to work around transposing? Additionally, I tried to look at
pd.pivot_table but it seems this typically only works with numerical data with sums,means, etc. Most of my VALUE attributes are actually strings, so I don't seem to be able to use this.
### doesn't work - can't use a multindex
df.pivot(index=['city','year'], columns = 'attribute', values='value')
Thank you for your help!
Is this what you are looking for:
import pandas as pd
from io import StringIO
data = """city-year-attribute-value
dallas-2002-crime-100
dallas-2003-crime-101
dallas-2002-population-4000
houston-2002-population-4100"""
df = pd.read_csv(StringIO(data), sep="-")
pivoted = df.pivot_table(
index=["city", "year"],
columns=["attribute"],
values=["value"]
)
print(pivoted.reset_index())
Result:
city year value
attribute crime population
0 dallas 2002 100.0 4000.0
1 dallas 2003 101.0 NaN
2 houston 2002 NaN 4100.0
I am new to Python and am trying to play around with the Pandas Pivot Tables. I have searched and searched but none of the answers have been what I am looking for. Basically, I am trying to sort the below pandas pivot table
import numpy as np
import pandas as pd
df = pd.DataFrame({
"TIME":["FQ1","FQ2","FQ2","FQ2"],
"NAME":["Robert",'Miranda',"Robert","Robert"],
"TOTAL":[900,42,360,2000],
"TYPE":["Air","Ground","Air","Ground"],
"GROUP":["A","A","A","A"]})
pt = pd.pivot_table(data=df,
values =["TOTAL"], aggfunc = (np.sum),
index = ["GROUP","TYPE","NAME"],
columns = "TIME",
fill_value=0,
margins = True)
Basically I am hoping to sort the "Type" and the "Name" column based on the sum of each row.
The end goal in this case would be "Ground" type appearing first before "Air", and within the "Ground" type, I'm hoping to have Robert appear before Miranda, since his sum is higher.
Here is how it appears now:
TOTAL
TIME FQ1 FQ2 All
GROUP TYPE NAME
A Air Robert 900 360 1260
Ground Miranda 0 42 42
Robert 0 2000 2000
All 900 2402 3302
Thanks to anyone who is able to help!!
Try this, because your column header is multiindex, you need to use a tuple to access the colums:
pt.sort_values(['GROUP','TYPE',('TOTAL','All')],
ascending=[True, True, False])
Output:
TOTAL
TIME FQ1 FQ2 All
GROUP TYPE NAME
A Air Robert 900 360 1260
Ground Robert 0 2000 2000
Miranda 0 42 42
All 900 2402 3302
I am using a pandas dataframe and I would like to remove all information after a space occures. My dataframe is similar as this one:
import pandas as pd
d = {'Australia' : pd.Series([0,'1980 (F)\n\n1957 (T)\n\n',1991], index=['Australia', 'Belgium', 'France']),
'Belgium' : pd.Series([1980,0,1992], index=['Australia','Belgium', 'France']),
'France' : pd.Series([1991,1992,0], index=['Australia','Belgium', 'France'])}
df = pd.DataFrame(d, dtype='str')
df
I am able to remove the values for one specific column, however the split() function does not apply to the whole dataframe.
f = lambda x: x["Australia"].split(" ")[0]
df = df.apply(f, axis=1)
Anyone an idea how I could remove the information after a space occures for each value in the dataframe?
I think need convert all columns to strings and then apply split function:
df = df.astype(str).apply(lambda x: x.str.split().str[0])
Another solution:
df = df.astype(str).applymap(lambda x: x.split()[0])
print (df)
Australia Belgium France
Australia 0 1980 1991
Belgium 1980 0 1992
France 1991 1992 0
Let's try using assign since the column names in this dataframe are "well tame" meaning not containing a space nor special characters:
df.assign(Australia=df.Australia.str.split().str[0])
Output:
Australia Belgium France
Australia 0 1980 1991
Belgium 1980 0 1992
France 1991 1992 0
Or you can use apply and a lamda function if all your column datatypes are strings:
df.apply(lambda x: x.str.split().str[0])
Or if you have a mixture of numbers and string dtypes then you can use select_dtypes with assign like this:
df.assign(**df.select_dtypes(exclude=np.number).apply(lambda x: x.str.split().str[0]))
You could loop over all columns and apply below:
for column in df:
df[column] = df[column].str.split().str[0]
I am trying to do some file merging with Latitude and Longitude.
Input File1.csv
Name,Lat,Lon,timeseries(n)
London,80.5234,121.0452,523
London,80.5234,121.0452,732
London,80.5234,121.0452,848
Paris,90.4414,130.0252,464
Paris,90.4414,130.0252,829
Paris,90.4414,130.0252,98
New York,110.5324,90.0023,572
New York,110.5324,90.0023,689
New York,110.5324,90.0023,794
File2.csv
Name,lat,lon,timeseries1
London,80.5234,121.0452,500
Paris,90.4414,130.0252,400
New York,110.5324,90.0023,700
Now Expected output is
File2.csv
Name,lat,lon,timeseries1,timeseries(n) #timeseries is 24 hrs format 17:45:00
London,80.5234,121.0452,500,2103 #Addition of all three values
Paris,90.4414,130.0252,400,1391
New York,110.5324,90.0023,700,2055
With python, numpy and dictionaries it would be straight as key = sum of values but I want to use Pandas
Please suggest me how to start with or may be a point me to some example. I have not see anything like Dictionary types with Pandas with Latitude and Longitude.
Perform a groupby aggregation on the first df, call sum and then merge this with the other df:
In [12]:
gp = df.groupby('Name')['timeseries(n)'].sum().reset_index()
df1.merge(gp, on='Name')
Out[14]:
Name Lat Lon timeseries1 timeseries(n)
0 London 80.5234 121.0452 500 2103
1 Paris 90.4414 130.0252 400 1391
2 New York 110.5324 90.0023 700 2055
the aggregation looks like this:
In [15]:
gp
Out[15]:
Name timeseries(n)
0 London 2103
1 New York 2055
2 Paris 1391
Your csv files can loaded using read_csv so something like:
df = pd.read_csv('File1.csv')
df1 = pd.read_csv('File2.csv')