output from summing two dataframes - python

I have two dataframe, df1 (1 row, 10 columns) and df2 ( 7 rows, 10 columns). Now I want to add values from those two dataframe:
final = df1[0] + df2[0][0]
print final
output:
0 23.8458
Name: 0, dtype: float64
I believe 23.8458 is what I want, but I don't understand what "0" stand for, and what datatype of "final", but I just want to keep 23.8458 as a float number. How can I do that? Thanks,

What you are printing is the Series where it has a single value with index value 0 and column value 23.8458 if you wanted just the value then print final[0] would give you what you want
You can see the type if you do print type(final)
Example:
In [42]:
df1 = pd.DataFrame(np.random.randn(1,7))
print(df1)
df2 = pd.DataFrame(np.random.randn(7,10))
print(df2)
0 1 2 3 4 5 6
0 -1.575662 0.725688 -0.442072 0.622824 0.345227 -0.062732 0.197771
0 1 2 3 4 5 6 \
0 -0.658687 0.404817 -0.786715 -0.923070 0.479590 0.598325 -0.495963
1 1.482873 -0.240854 -0.987909 0.085952 -0.054789 -0.576887 -0.940805
2 1.173126 0.190489 -0.809694 -1.867470 0.500751 -0.663346 -1.777718
3 -0.111570 -0.606001 -0.755202 -0.201915 0.933065 -0.833538 2.526979
4 1.537778 -0.739090 -1.813050 0.601448 0.296994 -0.966876 0.459992
5 -0.936997 -0.494562 0.365359 -1.351915 -0.794753 1.552997 -0.684342
6 0.128406 0.016412 0.461390 -2.411903 3.154070 -0.584126 0.136874
7 8 9
0 -0.483917 -0.268557 1.386847
1 0.379854 0.205791 -0.527887
2 -0.307892 -0.915033 0.017231
3 -0.672195 0.110869 1.779655
4 0.241685 1.899335 -0.334976
5 -0.510972 -0.733894 0.615160
6 2.094503 -0.184050 1.328208
In [43]:
final = df1[0] + df2[0][0]
print(final)
0 -2.234349
Name: 0, dtype: float64
In [44]:
print(type(final))
<class 'pandas.core.series.Series'>
In [45]:
final[0]
Out[45]:
-2.2343491912631328

Related

How to apply .astype() method to a dataframe in Python?

I want to convert multiple columns in a dataframe (pandas) to the type "category" using the method .astype. Here is my code:
df['Field_1'].astype('category').cat.codes
works however
categories = df.select_types('objects')
categories['Field_1'].cat.codes
doesn't.
Would someone please tell my why?
In general, the question is how to apply a method (.astype) to a dataframe? I know how to apply a method to a column in a dataframe, however, applying it to a dataframe hasnt been successful, even with for loop since the for loop returns a series and the method .cat.codes is not appliable for the series.
I think you need processing each column separately in DataFrame.apply and lambda function, your code failed, because Series.cat.codes is not implemented for DataFrame:
df = pd.DataFrame({
'A':list('acbdac'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':list('dddbbb')
})
cols = df.select_dtypes('object').columns
df[cols] = df[cols].apply(lambda x: x.astype('category').cat.codes)
print (df)
A B C D
0 0 4 7 1
1 2 5 8 1
2 1 4 9 1
3 3 5 4 0
4 0 5 2 0
5 2 4 3 0
Similar idea, not sure if same output if convert all columns to categorical in first step by DataFrame.astype:
cols = df.select_dtypes('object').columns
df[cols] = df[cols].astype('category').apply(lambda x: x.cat.codes)
print (df)
A B C D
0 0 4 7 1
1 2 5 8 1
2 1 4 9 1
3 3 5 4 0
4 0 5 2 0
5 2 4 3 0

How to rest a row value to the nths rows values of another dataframe

I have this two df's
df1:
lon lat
0 -60.7 -2.8333333333333335
1 -55.983333333333334 -2.4833333333333334
2 -51.06666666666667 -0.05
3 -66.96666666666667 -0.11666666666666667
4 -48.483333333333334 -1.3833333333333333
5 -54.71666666666667 -2.4333333333333336
6 -44.233333333333334 -2.6
7 -59.983333333333334 -3.15
df2:
lon lat
0 -24.109 -2.0035
1 -17.891 -1.70911
2 -14.5822 -1.7470700000000001
3 -12.8138 -1.72322
4 -14.0688 -1.5028700000000002
5 -13.8406 -1.44416
6 -12.1292 -0.671266
7 -13.8406 -0.8824270000000001
8 -15.12 -18.223
I want to rest each value of df1['lat'] with all values of df2
Something like this :
results0=df1.loc[0,'lat']-df2.loc[:,'lat']
results1=df1.loc[1,'lat']-df2.loc[:,'lat']
#etc etc....
So i tried this:
for i,j in zip(range(len(df1)), range(len(df2))):
exec(f"result{i}=df1.loc[{i},'lat']-df2.loc[{j},'lat']")
But it only gave me one result value for each result, instead of 8 values for each result.
I will appreciate any possible solution. Thanks!
You can create list of Series:
L = [df1.loc[i,'lat']-df2['lat'] for i in df1.index]
Or you can use numpy for new DataFrame:
arr = df1['lat'].to_numpy() - df2['lat'].to_numpy()[:, None]
df3 = pd.DataFrame(arr, index=df2.index, columns=df1.index)
print (df3)
0 1 2 3 4 5 \
0 -0.829833 -0.479833 1.953500 1.886833 0.620167 -0.429833
1 -1.124223 -0.774223 1.659110 1.592443 0.325777 -0.724223
2 -1.086263 -0.736263 1.697070 1.630403 0.363737 -0.686263
3 -1.110113 -0.760113 1.673220 1.606553 0.339887 -0.710113
4 -1.330463 -0.980463 1.452870 1.386203 0.119537 -0.930463
5 -1.389173 -1.039173 1.394160 1.327493 0.060827 -0.989173
6 -2.162067 -1.812067 0.621266 0.554599 -0.712067 -1.762067
7 -1.950906 -1.600906 0.832427 0.765760 -0.500906 -1.550906
8 15.389667 15.739667 18.173000 18.106333 16.839667 15.789667
6 7
0 -0.596500 -1.146500
1 -0.890890 -1.440890
2 -0.852930 -1.402930
3 -0.876780 -1.426780
4 -1.097130 -1.647130
5 -1.155840 -1.705840
6 -1.928734 -2.478734
7 -1.717573 -2.267573
8 15.623000 15.073000
Since df1 has one less row than df2
df1['lat'] = df1['lat'] - df2.loc[:df1.shape[0]-1, 'lat']
output:
0 -0.829833
1 -0.774223
2 1.697070
3 1.606553
4 0.119537
5 -0.989173
6 -1.928734
7 -2.267573
Name: lat, dtype: float64

drop group by number of occurrence

Hi I want to delete the rows with the entries whose number of occurrence is smaller than a number, for example:
df = pd.DataFrame({'a': [1,2,3,2], 'b':[4,5,6,7], 'c':[0,1,3,2]})
df
a b c
0 1 4 0
1 2 5 1
2 3 6 3
3 2 7 2
Here I want to delete all the rows if the number of occurrence in column 'a' is less than twice.
Wanted output:
a b c
1 2 5 1
3 2 7 2
What I know:
we can find the number of occurrence by condition = df['a'].value_counts() < 2, and it will give me something like:
2 False
3 True
1 True
Name: a, dtype: int64
But I don't know how I should approach from here to delete the rows.
Thanks in advance!
groupby + size
res = df[df.groupby('a')['b'].transform('size') >= 2]
The transform method maps df.groupby('a')['b'].size() to df aligned with df['a'].
value_counts + map
s = df['a'].value_counts()
res = df[df['a'].map(s) >= 2]
print(res)
a b c
1 2 5 1
3 2 7 2
You Can use df.where and the dropna
df.where(df['a'].value_counts() <2).dropna()
a b c
1 2.0 5.0 1.0
3 2.0 7.0 2.0
You could try something like this to get the length of each group, transform back to original index and index the df by it
df[df.groupby("a").transform(len)["b"] >= 2]
a b c
1 2 5 1
3 2 7 2
Breaking it into individual steps you get:
df.groupby("a").transform(len)["b"]
0 1
1 2
2 1
3 2
Name: b, dtype: int64
These are the group sizes transformed back onto your original index
df.groupby("a").transform(len)["b"] >=2
0 False
1 True
2 False
3 True
Name: b, dtype: bool
We then turn this into the boolean index and index our original dataframe by it

Delete a row if it doesn't contain a specified integer value (Pandas)

I have a Pandas dataset that I want to clean up prior to applying my ML algorithm. I am wondering if it was possible to remove a row if an element of its columns does not match a set of values. For example, if I have the dataframe:
a b
0 1 6
1 4 7
2 2 4
3 3 7
...
And I desire the values of a to be one of [1,3] and of b to be one of [6,7], such that my final dataset is:
a b
0 1 6
1 3 7
...
Currently, my implementation is not working as some of my data rows have erroneous strings attached to the value. For example, instead of a value of 1 I'll have something like 1abc. Hence why I would like to remove anything that is not an integer of that value.
My workaround is also a bit archaic, as I am removing entries for column a that do not have 1 or 3 via:
dataset = dataset[(dataset.commute != 1)]
dataset = dataset[(dataset.commute != 3)]
You can use boolean indexing with double isin and &:
df1 = df[(df['a'].isin([1,3])) & (df['b'].isin([6,7]))]
print (df1)
a b
0 1 6
3 3 7
Or use numpy.in1d:
df1 = df[(np.in1d(df['a'], [1,3])) & (np.in1d(df['b'], [6,7])) ]
print (df1)
a b
0 1 6
3 3 7
But if need remove all rows with non numeric then need to_numeric with errors='coerce' which return NaN and then is possible filter it by notnull:
df = pd.DataFrame({'a':['1abc','2','3'],
'b':['4','5','dsws7']})
print (df)
a b
0 1abc 4
1 2 5
2 3 dsws7
mask = pd.to_numeric(df['a'], errors='coerce').notnull() &
pd.to_numeric(df['b'], errors='coerce').notnull()
df1 = df[mask].astype(int)
print (df1)
a b
1 2 5
If need check if some value is NaN or None:
df = pd.DataFrame({'a':['1abc',None,'3'],
'b':['4','5',np.nan]})
print (df)
a b
0 1abc 4
1 None 5
2 3 NaN
print (df[df.isnull().any(axis=1)])
a b
1 None 5
2 3 NaN
You can use pandas isin()
df = df[df.a.isin([1,3]) & df.b.isin([6,7])]
a b
0 1 6
3 3 7

Pandas column.sum() without having the index values multiply

I have a pd like this:
When I take the .sum() of the columns, Pandas is multiplying each row entry by the index value.
I need just a raw count at the end of each column, not a "sum" per se. What is the best way?
To find the sum of the values, use .sum(). To find a count of the non-empty cells, use .count(). To find a count of the cells which have a value greather than 0, try df[df>0].count().
In [29]: df=pd.read_table('data.csv', delim_whitespace=True)
In [30]: df
Out[30]:
BPC B-S
0 2 1
1 5 2
2 0 1
3 0 0
4 0 0
5 2 1
6 8 3
7 38 12
[8 rows x 2 columns]
In [31]: df.sum()
Out[31]:
BPC 55
B-S 20
dtype: int64
In [32]: df[df>0].count()
Out[32]:
BPC 5
B-S 6
dtype: int64

Categories