I have a df with nations as index and years(1990-2015) as header. I want to make a new df2 where every column is the sum of 5 year, eg: 1995-1999, 2000-2004 etc
I have done this:
df2 = pd.DataFrame(index=df.index[:], columns=['1995', '2000', '2005', '2010', '2015'])
df2['1995'] = df.iloc[0:4].sum(axis=1)
But it doesnt replace the NaN values.
What am I doing wrong? Thanks in advance
Step 1
Transpose and reset index with df.T.reset_index
df2 = df.T.reset_index(drop=True)
Step 2
Using df.groupby, group by index in sets of 5, and then sum with dfGroupBy.agg, passing np.nansum
df2 = df2.groupby(df2.index // 5).agg(np.nansum).T
Step 3
Assign the columns inplace
df2.columns = pd.to_datetime(df.columns[::5]).year + 5
df = ... # Borrowed from Bharath
df2 = df.T.reset_index(drop=True)
df2 = df2.groupby(df2.index // 5).sum().T
df2.columns = pd.to_datetime(df.columns[::5]).year + 5
print(df2)
Output:
1995 2000 2005 2010
Country
IN 72 29 100 2
EG 31 40 40 24
I think you are looking for sum of every 5 columns after a specific column. One way of doing it is using a for loop for concatinating data after slicing i.e if you have a dataframe
df = pd.DataFrame({'Country':['IN','EG'],'1990':[2,4],'1991':[4,5],'1992':[2,4],'1993':[2,4],'1994':[62,14],'1995':[21,4],'1996':[2,14],'1997':[2,4],'1998':[2,14],'1999':[2,4],'2000':[2,4],'2001':[2,14],'2002':[92,4],'2003':[2,4],'2004':[2,14],'2005':[2,24]})
df.set_index('Country',drop=True,inplace=True)
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 \
Country
IN 2 4 2 2 62 21 2 2 2 2 2
EG 4 5 4 4 14 4 14 4 14 4 4
2001 2002 2003 2004 2005
Country
IN 2 92 2 2 2
EG 14 4 4 14 24
Then
df2 = pd.DataFrame(index=df.index[:])
columns=['1990','1995', '2000', '2005']
for x in columns:
df2 = pd.concat([df2,df[df.columns[df.columns.tolist().index(x):][0:5]].sum(axis=1)],axis=1)
df2.columns= columns
Output :
1990 1995 2000 2005
Country
IN 72 29 100 2
EG 31 40 40 24
If you want to set different columns then ,
df2.columns = ['1990-1994','1995-1999','1999-2004','2005-']
Hope it helps
You can use:
convert columns to_datetime
resample by columns (axis=1) by 5A (years) and aggregate sum
last get years from columns by DatetimeIndex.year and remove 4
df.columns = pd.to_datetime(df.columns, format='%Y')
df2 = df.resample('5A',axis=1, closed='left').sum()
df2.columns = df2.columns.year - 4
print (df2)
1990 1995 2000 2005
Country
IN 72 29 100 2
EG 31 40 40 24
If need change years, also is possible add 1:
df.columns = pd.to_datetime(df.columns, format='%Y')
df2 = df.resample('5A',axis=1, closed='left').sum()
df2.columns = df2.columns.year + 1
print (df2)
1995 2000 2005 2010
Country
IN 72 29 100 2
EG 31 40 40 24
Related
I have a dataset like this
id event 2015 2016 2017
a 2015 33 na na
a 2016 na 32 na
a 2017 na na 31
b 2015 30 na na
b 2017 na na 20
how do I make all the non-missing values in the same row:
id 2015 2016 2017
a 33 32 31
b 30 0 20
sorry the questions above do not solve my case, and the code does not work
try:
df = df.set_index('event').replace('na', np.nan)
df1 = pd.concat([df[col].dropna() for col in df.columns], axis=0).to_frame().T
df1:
event 2015 2016 2017
0 33 32 31
1st replace na by NaN then set event as index. Dropall the NaN from the column values.
Use GroupBy.first for first non missing value per groups by id:
df = (df.drop('event', axis=1)
.replace('na', np.nan)
.groupby('id', as_index=False)
.first()
.fillna(0))
print (df)
id 2015 2016 2017
0 a 33 32 31
1 b 30 0 20
There are two files. If the ID number matches both files, then I want only the value 1 and value 2 from File2.txt , Please let me know if my question is unclear
File1.txt
ID Number Value 1 Value 2 Country
0001 23 55 Spain
0231 15 23 USA
4213 10 11 Canada
7541 32 29 Italy
File2.txt
0001 5 6
0231 7 18
4213 54 87
5554 12 10
1111 31 13
6422 66 51
The output should look like this.
ID Number Value 1 Value 2 Country
0001 5 6 Spain
0231 7 18 USA
4213 54 87 Canada
7541 32 29 Italy
New example:
File3.txt
#ID CAT CHN LC SC LATITUDE LONGITUDE
20022 CX 21 -- 4 32.739000 -114.635700
01711 CX 21 -- 3 32.779700 -115.567500
08433 CX 21 -- 2 31.919930 -123.321000
File4.txt
20022,32.45,-114.88
01192,32.839,-115.487
01711,32.88,-115.45
01218,32.717,-115.637
output
#ID CAT CHN LC SC LATITUDE LONGITUDE
20022 CX 21 -- 4 32.45 -114.88
01711 CX 21 -- 3 32.88 -115.45
08433 CX 21 -- 2 31.919930 -123.321000
Code I got so far
f = open("File3.txt", "r")
x= open("File4.txt","r")
df1 = pd.read_csv(f, sep=' ', engine='python')
df2 = pd.read_csv(x, sep=' ', header=None, engine='python')
df2 = df2.set_index(0).rename_axis("#ID")
df2 = df2.rename(columns={5:'LATITUDE', 6: 'LONGITUDE'})
df1 = df1.set_index('#ID')
df1.update(df2)
print(df1)
Something like this, possibly:
file1_data = []
file1_headers = []
with open("File1.txt") as file1:
for line in file1:
file1_data.append(line.strip().split("\t"))
file1_headers = file1_data[0]
del file1_data[0]
file2_data = []
with open("File2.txt") as file2:
for line in file2:
file2_data.append(line.strip().split("\t"))
file2_ids = [x[0] for x in file2_data]
final_data = [file1_headers] + file1_data
for i in range(1, len(final_data)):
if final_data[i][0] in file2_ids:
match = [x for x in file2_data if x[0] == final_data[i][0]]
final_data[i] = [match[0] + [final_data[i][3]]]
with open("output.txt", "w") as output:
output.writelines(["\t".join(x) for x in final_data])
final_data becomes an alias of file1_data and then is selectively replacing rows with matching id's in file2_data, but keeping the country.
Okay, what you need to do here is to get the indexes to match in both dataframes after importing. This is important because pandas use data alignment based on indexes.
Here is a complete example using your data:
from io import StringIO
import pandas as pd
File1txt=StringIO("""ID Number Value 1 Value 2 Country
0001 23 55 Spain
0231 15 23 USA
4213 10 11 Canada
7541 32 29 Italy""")
File2txt = StringIO("""0001 5 6
0231 7 18
4213 54 87
5554 12 10
1111 31 13
6422 66 51""")
df1 = pd.read_csv(File1txt, sep='\s\s+', engine='python')
df2 = pd.read_csv(File2txt, sep='\s\s+', header=None, engine='python')
print(df1)
# ID Number Value 1 Value 2 Country
# 0 1 23 55 Spain
# 1 231 15 23 USA
# 2 4213 10 11 Canada
# 3 7541 32 29 Italy
print(df2)
# 0 1 2
# 0 1 5 6
# 1 231 7 18
# 2 4213 54 87
# 3 5554 12 10
# 4 1111 31 13
# 5 6422 66 51
df2 = df2.set_index(0).rename_axis('ID Number')
df2 = df2.rename(columns={1:'Value 1', 2: 'Value 2'})
df1 = df1.set_index('ID Number')
df1.update(df2)
print(df1.reset_index())
Output:
ID Number Value 1 Value 2 Country
0 1 5.0 6.0 Spain
1 231 7.0 18.0 USA
2 4213 54.0 87.0 Canada
3 7541 32.0 29.0 Italy
My database from excel has some information by Country for Years. The problem is each year is a different column header. For example:
Country Indicator 1950 1951 1952
Australia x 10 27 20
Australia y 7 11 8
Australia z 40 32 37
I want to convert each Indicator as a column header and make a column by year. Like this:
Country year x y z
Australia 1950 10 7 40
Australia 1951 27 11 32
Australia 1952 20 8 37
And I don't know how many countries are in the column. Years = 1950 to 2019
We can do format with stack and unstack
df.set_index(['Country','Indicator']).stack().unstack(level=1).reset_index()
Indicator Country level_1 x y z
0 Australia 1950 10 7 40
1 Australia 1951 27 11 32
2 Australia 1952 20 8 37
This is just an exploration ... #Yoben's solution is the proper way to do it via Pandas ... I just seeing what other possibilities there are :
#create a dictionary of the years
years = {'Year' : df.filter(regex='\d').columns}
#get the data for the years column
year_data = df.filter(regex='\d').to_numpy()
#create a dictionary from the indicator and years columns pairing
reshaped = dict(zip(df.Indicator,year_data))
reshaped.update(years)
#create a new dataframe
pd.DataFrame(reshaped,index=df.Country)
x y z Year
Country
Australia 10 7 40 1950
Australia 27 11 32 1951
Australia 20 8 37 1952
You should never have to do this, as u could easily work within the dataframe, without the need to create a new one. The only time u may consider this is for the speed. Besides that, just something to explore
It's not exactly what you are looking for, but if your dataframe is the variable df, you can use the transpose method to invert the dataframe.
In [7]: df
Out[7]:
col1 col2 col3
0 1 True 10
1 2 False 10
2 3 False 100
3 4 True 100
Transpose
In [8]: df.T
Out[8]:
0 1 2 3
col1 1 2 3 4
col2 True False False True
col3 10 10 100 100
I think you have a multi-index dataframe so you may want to check the documentation on that.
I am trying to add an empty column after the 3ed column on my data frame that contains 5 columns. Example:
Fname,Lname,city,state,zip
mike,smith,new york,ny,11101
This is what I have and below I am going to show what I want it to look like.
Fname,Lname,new column,city,state,zip
mike,smith,,new york,ny,11101
I dont want to populate that column with data all I want to do is add the extra column in the header and that data will have the blank column aka ',,'.
Ive seen examples where a new column is added to the end of a data frame but not at a specific placement.
you should use
df.insert(loc, column, value)
with loc being the index and column the column name and value it's value
for an empty column
df.insert(loc=2, column='new col', value=['' for i in range(df.shape[0])])
Use reindex or column filtering
df = pd.DataFrame(np.arange(50).reshape(10,-1), columns=[*'ABCDE'])
df['z']= np.nan
df[['A','z','B','C','D','E']]
OR
df.reindex(['A','z','B','C','D','E'], axis=1)
Output:
A z B C D E
0 0 NaN 1 2 3 4
1 5 NaN 6 7 8 9
2 10 NaN 11 12 13 14
3 15 NaN 16 17 18 19
4 20 NaN 21 22 23 24
5 25 NaN 26 27 28 29
6 30 NaN 31 32 33 34
7 35 NaN 36 37 38 39
8 40 NaN 41 42 43 44
9 45 NaN 46 47 48 49
You can simply go for df.insert()
import pandas as pd
data = {'Fname': ['mike'],
'Lname': ['smith'],
'city': ['new york'],
'state': ['ny'],
'zip': [11101]}
df = pd.DataFrame(data)
df.insert(1, "Address", '', True)
print(df)
Output:
Fname Address Lname city state zip
0 mike smith new york ny 11101
I have a pandas data frame df like this.
In [1]: df
Out[1]:
country count
0 Japan 78
1 Japan 80
2 USA 45
3 France 34
4 France 90
5 UK 45
6 UK 34
7 China 32
8 China 87
9 Russia 20
10 Russia 67
I want to remove rows with the maximum value in each group. So the result should look like:
country count
0 Japan 78
3 France 34
6 UK 34
7 China 32
9 Russia 20
My first attempt:
idx = df.groupby(['country'], sort=False).max()['count'].index
df_new = df.drop(list(idx))
My second attempt:
idx = df.groupby(['country'])['count'].transform(max).index
df_new = df.drop(list(idx))
But it didn't work. Any ideas?
groupby / transform('max')
You can first calculate a series of maximums by group. Then filter out instances where count is equal to that series. Note this will also remove duplicates maximums.
g = df.groupby(['country'])['count'].transform('max')
df = df[~(df['count'] == g)]
The series g represents maximums for each row by group. Where this equals df['count'] (by index), you have a row where you have the maximum for your group. You then use ~ for the negative condition.
print(df.groupby(['country'])['count'].transform('max'))
0 80
1 80
2 45
3 90
4 90
5 45
6 45
7 87
8 87
9 20
Name: count, dtype: int64
sort + drop
Alternatively, you can sort and drop the final occurrence:
res = df.sort_values('count')
res = res.drop(res.groupby('country').tail(1).index)
print(res)
country count
9 Russia 20
7 China 32
3 France 34
6 UK 34
0 Japan 78