I have a pd like this:
When I take the .sum() of the columns, Pandas is multiplying each row entry by the index value.
I need just a raw count at the end of each column, not a "sum" per se. What is the best way?
To find the sum of the values, use .sum(). To find a count of the non-empty cells, use .count(). To find a count of the cells which have a value greather than 0, try df[df>0].count().
In [29]: df=pd.read_table('data.csv', delim_whitespace=True)
In [30]: df
Out[30]:
BPC B-S
0 2 1
1 5 2
2 0 1
3 0 0
4 0 0
5 2 1
6 8 3
7 38 12
[8 rows x 2 columns]
In [31]: df.sum()
Out[31]:
BPC 55
B-S 20
dtype: int64
In [32]: df[df>0].count()
Out[32]:
BPC 5
B-S 6
dtype: int64
Related
how to find the most frequent value of each row of a dataframe?
For example:
In [14]: df
Out[14]:
a b c
0 2 3 3
1 1 1 2
2 7 7 8
return:
[3,1,7]
try .mode() method:
In [88]: df
Out[88]:
a b c
0 2 3 3
1 1 1 2
2 7 7 8
In [89]: df.mode(axis=1)
Out[89]:
0
0 3
1 1
2 7
From docs:
Gets the mode(s) of each element along the axis selected. Adds a row
for each mode per label, fills in gaps with nan.
Note that there could be multiple values returned for the selected
axis (when more than one item share the maximum frequency), which is
the reason why a dataframe is returned. If you want to impute missing
values with the mode in a dataframe df, you can just do this:
df.fillna(df.mode().iloc[0])
I'm using Pandas to come up with new column that will search through the entire column with values [1-100] and will count the values where it's less than the current row.
See [df] example below:
[A][NewCol]
1 0
3 2
2 1
5 4
8 5
3 2
Essentially, for each row I need to look at the entire Column A, and count how many values are less than the current row. So for Value 5, there are 4 values that are less (<) than 5 (1,2,3,3).
What would be the easiest way of doing this?
Thanks!
One way to do it like this, use rank with method='min':
df['NewCol'] = (df['A'].rank(method='min') - 1).astype(int)
Output:
A NewCol
0 1 0
1 3 2
2 2 1
3 5 4
4 8 5
5 3 2
I am using numpy broadcast
s=df.A.values
(s[:,None]>s).sum(1)
Out[649]: array([0, 2, 1, 4, 5, 2])
#df['NewCol']=(s[:,None]>s).sum(1)
timing
df=pd.concat([df]*1000)
%%timeit
s=df.A.values
(s[:,None]>s).sum(1)
10 loops, best of 3: 83.7 ms per loop
%timeit (df['A'].rank(method='min') - 1).astype(int)
1000 loops, best of 3: 479 µs per loop
Try this code
A = [Your numbers]
less_than = []
for element in A:
counter = 0
for number in A:
if number < element:
counter += 1
less_than.append(counter)
You can do it this way:
import pandas as pd
df = pd.DataFrame({'A': [1,3,2,5,8,3]})
df['NewCol'] = 0
for idx, row in df.iterrows():
df.loc[idx, 'NewCol'] = (df.loc[:, 'A'] < row.A).sum()
print(df)
A NewCol
0 1 0
1 3 2
2 2 1
3 5 4
4 8 5
5 3 2
Another way is sort and reset index:
m=df.A.sort_values().reset_index(drop=True).reset_index()
m.columns=['new','A']
print(m)
new A
0 0 1
1 1 2
2 2 3
3 3 3
4 4 5
5 5 8
You didn't specify if speed or memory usage was important (or if you had a very large dataset). The "easiest" way to do it is straightfoward: calculate how many are less then i for each entry in the column and collect those into a new column:
df=pd.DataFrame({'A': [1,3,2,5,8,3]})
col=df['A']
df['new_col']=[ sum(col<i) for i in col ]
print(df)
Result:
A new_col
0 1 0
1 3 2
2 2 1
3 5 4
4 8 5
5 3 2
There might be more efficient ways to do this on large datasets, such as sorting your column first.
how to find the most frequent value of each row of a dataframe?
For example:
In [14]: df
Out[14]:
a b c
0 2 3 3
1 1 1 2
2 7 7 8
return:
[3,1,7]
try .mode() method:
In [88]: df
Out[88]:
a b c
0 2 3 3
1 1 1 2
2 7 7 8
In [89]: df.mode(axis=1)
Out[89]:
0
0 3
1 1
2 7
From docs:
Gets the mode(s) of each element along the axis selected. Adds a row
for each mode per label, fills in gaps with nan.
Note that there could be multiple values returned for the selected
axis (when more than one item share the maximum frequency), which is
the reason why a dataframe is returned. If you want to impute missing
values with the mode in a dataframe df, you can just do this:
df.fillna(df.mode().iloc[0])
I have two dataframe, df1 (1 row, 10 columns) and df2 ( 7 rows, 10 columns). Now I want to add values from those two dataframe:
final = df1[0] + df2[0][0]
print final
output:
0 23.8458
Name: 0, dtype: float64
I believe 23.8458 is what I want, but I don't understand what "0" stand for, and what datatype of "final", but I just want to keep 23.8458 as a float number. How can I do that? Thanks,
What you are printing is the Series where it has a single value with index value 0 and column value 23.8458 if you wanted just the value then print final[0] would give you what you want
You can see the type if you do print type(final)
Example:
In [42]:
df1 = pd.DataFrame(np.random.randn(1,7))
print(df1)
df2 = pd.DataFrame(np.random.randn(7,10))
print(df2)
0 1 2 3 4 5 6
0 -1.575662 0.725688 -0.442072 0.622824 0.345227 -0.062732 0.197771
0 1 2 3 4 5 6 \
0 -0.658687 0.404817 -0.786715 -0.923070 0.479590 0.598325 -0.495963
1 1.482873 -0.240854 -0.987909 0.085952 -0.054789 -0.576887 -0.940805
2 1.173126 0.190489 -0.809694 -1.867470 0.500751 -0.663346 -1.777718
3 -0.111570 -0.606001 -0.755202 -0.201915 0.933065 -0.833538 2.526979
4 1.537778 -0.739090 -1.813050 0.601448 0.296994 -0.966876 0.459992
5 -0.936997 -0.494562 0.365359 -1.351915 -0.794753 1.552997 -0.684342
6 0.128406 0.016412 0.461390 -2.411903 3.154070 -0.584126 0.136874
7 8 9
0 -0.483917 -0.268557 1.386847
1 0.379854 0.205791 -0.527887
2 -0.307892 -0.915033 0.017231
3 -0.672195 0.110869 1.779655
4 0.241685 1.899335 -0.334976
5 -0.510972 -0.733894 0.615160
6 2.094503 -0.184050 1.328208
In [43]:
final = df1[0] + df2[0][0]
print(final)
0 -2.234349
Name: 0, dtype: float64
In [44]:
print(type(final))
<class 'pandas.core.series.Series'>
In [45]:
final[0]
Out[45]:
-2.2343491912631328
I'm reading in two csv files, selecting data from a specific column, dropping NA/nulls, and then using the data that fits some condition in one file to print the associated data in another:
data1 = pandas.read_csv(filename1, usecols = ['X', 'Y', 'Z']).dropna()
data2 = pandas.read_csv(filename2, usecols = ['X', 'Y', 'Z']).dropna()
i=0
for item in data1['Y']:
if item > -20:
print data2['X'][i]
But this throws me an error:
File "hashtable.pyx", line 381, in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:7035)
File "hashtable.pyx", line 387, in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:6976)
KeyError: 6L
Turns out when I print data2['X'] I see missing numbers in the index of rows
0 -1.953779
1 -2.010039
2 -2.562191
3 -2.723993
4 -2.302720
5 -2.356181
7 -1.928778
...
How do I fix this and renumber the index values? Or is there a better way?
Found a solution in another question from here: Reindexing dataframes
.reset_index(drop=True) does the trick!
0 -1.953779
1 -2.010039
2 -2.562191
3 -2.723993
4 -2.302720
5 -2.356181
6 -1.928778
7 -1.925359
Are your two files/dataframes the same length? If so, you can leverage boolean masks and do this (and it avoids a for loop):
data2['X'][data1['Y'] > -20]
Edit: in response to the comment
What happens in between:
In [16]: df1
Out[16]:
X Y
0 0 0
1 1 2
2 2 4
3 3 6
4 4 8
In [17]: df2
Out[17]:
Y X
0 64 75
1 65 73
2 36 44
3 13 58
4 92 54
# creates a pandas Series object of True/False, which you can then use as a "mask"
In [18]: df2['Y'] > 50
Out[18]:
0 True
1 True
2 False
3 False
4 True
Name: Y, dtype: bool
# mask is applied element-wise to (in this case) the column of your DataFrame you want to filter
In [19]: df1['X'][ df2['Y'] > 50 ]
Out[19]:
0 0
1 1
4 4
Name: X, dtype: int64
# same as doing this (where mask is applied to the whole dataframe, and then you grab your column
In [20]: df1[ df2['Y'] > 50 ]['X']
Out[20]:
0 0
1 1
4 4
Name: X, dtype: int64