Python:How to count specific values in specific columns in a dataframe - python

I have a particular csv for eg:
col1 col2 col3 col4
a 1 2 3
b 1 2 1
c 1 1 3
d 3 1 2
I want to count number of a particular value for eg. 1 in col2, col3 and col4
I am using the following code using pandas
import pandas as pd
fname = input('Enter the filename:')
df = pd.read_csv (fname, header='infer')
one = df.iloc[:,1:4].value_counts(normalize=False).loc[1]
It is showing error but when I am doing the same for a particular defined column the code is running properly
import pandas as pd
fname = input('Enter the filename:')
df = pd.read_csv (fname, header='infer')
one = df[col1].value_counts(normalize=False).loc[1]
I want the following output
col2 3
col3 2
col4 1
Any help or tips would be greatly appreciated! Thank you in advance. :)

Use eq with desired value i.e. 1 and then sum as:
df1[['col2', 'col3', 'col4']].eq(1).sum()
col2 3
col3 2
col4 1
dtype: int64

I reached this question when searching for a way to check how many values are actually higher/lower than zero, on the columns 'Buys' and 'Sells', on the following DataFrame (named "trade_track"):
Ticker Pre-trade Buys Sells Net Exposure Ch. Post-trade
CX 10126.0 0.0 -964.0 -0.095200 9162.0
OI 3311.0 0.0 -24.0 -0.007249 3287.0
THO 748.0 0.0 -33.0 -0.044118 715.0
WRK 1002.0 0.0 -43.0 -0.042914 959.0
TAP 646.0 0.0 -4.0 -0.006192 642.0
TRN 1987.0 0.0 -93.0 -0.046804 1894.0
SJM 312.0 6.0 0.0 0.019231 318.0
WW 1100.0 0.0 -22.0 -0.020000 1078.0
FAST -655.0 13.0 0.0 -0.019847 -642.0
CSX -301.0 6.0 0.0 -0.019934 -295.0
ODFL -123.0 0.0 0.0 -0.000000 -123.0
HELE -130.0 0.0 0.0 -0.000000 -130.0
SBUX -203.0 0.0 0.0 -0.000000 -203.0
WM -166.0 0.0 0.0 -0.000000 -166.0
HD -90.0 2.0 0.0 -0.022222 -88.0
VMC -141.0 0.0 0.0 -0.000000 -141.0
CTAS -76.0 2.0 0.0 -0.026316 -74.0
ORLY -53.0 0.0 0.0 -0.000000 -53.0
Here is a simple code that worked:
(i) to find all numbers above zero on column 'Buys':
((trade_track['Buys'])>0).sum()
(ii) to find all zeros on column 'Buys':
((trade_track['Buys'])==0).sum()

Related

Correlation between columns of different dataframes

I have many dataframes. They all share the same column structure "date", "open_position_profit", "more columns...".
date open_position_profit col2 col3
0 2008-04-01 -260.0 1 290.0
1 2008-04-02 -340.0 1 -60.0
2 2008-04-03 100.0 1 40.0
3 2008-04-04 180.0 1 -90.0
4 2008-04-05 0.0 0 0.0 0.0 1
Although "date" is present in all dataframes, they might or might not have the same count (some dates might be in one dataframe but not the other).
I want to compute a correlation matrix of the columns "open_position_profit" of all these dataframes.
I've tried this
dfs = [df1[["date", "open_position_profit"]], df2[["date", "open_position_profit"]], ...]
pd.concat(dfs).groupby('date', as_index=False).corr()
But this gives me a series of the correlation for each cell:
open_position_profit
0 open_position_profit 1.0
1 open_position_profit 1.0
2 open_position_profit 1.0
3 open_position_profit 1.0
4 open_position_profit NaN
I want the correlation for the entire time series, not each single cell. How can I do this?
If I understand your intention correctly, it is necessary to do outer join first. The following code does outer join by date key. The missing value can be represented by NaN.
df = pd.merge(df1, df2, on='date', how='outer')
date open_position_profit_x open_position_profit_y ... ...
0 2019-01-01 ...
1 2019-01-02 ...
2 2019-01-03 ...
3 2019-01-04 ...
Then you can calculate the correlation with the new DataFrame.
df.corr()
open_position_profit_x open_position_profit_y ... ...
open_position_profit_x 1.000000 0.866025
open_position_profit_y 0.866025 1.000000
... 1.000000 1.000000
... 1.000000 1.000000
See: pd.merge

python pandas dataframe how to apply a function to each time period

I have the following dataframe,
df = pd.DataFrame({'col1':range(9), 'col2': list(range(7)) + [np.nan] *2},
index = pd.date_range('1/1/2000', periods=9, freq='0.5S'))
df
Out[109]:
col1 col2
2000-01-01 00:00:00.000 0 0.0
2000-01-01 00:00:00.500 1 1.0
2000-01-01 00:00:01.000 2 2.0
2000-01-01 00:00:01.500 3 3.0
2000-01-01 00:00:02.000 4 4.0
2000-01-01 00:00:02.500 5 5.0
2000-01-01 00:00:03.000 6 6.0
2000-01-01 00:00:03.500 7 NaN
2000-01-01 00:00:04.000 8 NaN
As can been seen above, each second there are two data point. What I would like to do is for the two rows in a second, if both cols in the latest row has valid number, that row will be chosen; if either cols in the latest row is invalid, we will see previous row is valid for bot col, if valid, we will chose previous row, otherwise we will skip the second. The resuling dataframe looks like this,
col1 col2
2000-01-01 00:00:00.000 1 1.0
2000-01-01 00:00:01.000 3 3.0
2000-01-01 00:00:02.000 5 5.0
2000-01-01 00:00:03.000 6 6.0
How to achieve this?
Here is one way using reindex after dropna we reindex , then both of the columns become NaN, In this situation if we using last , we will not select any item from this row (correlated with your previous question )
df.dropna().reindex(df.index).resample('1s').last().dropna()
Out[175]:
col1 col2
2000-01-01 00:00:00 1.0 1.0
2000-01-01 00:00:01 3.0 3.0
2000-01-01 00:00:02 5.0 5.0
2000-01-01 00:00:03 6.0 6.0

astype() does not change floats

Even though this seems really simple, it drives me nuts. Why is .astype(int) not changing the floats to ints? Thank you
df_new = pd.crosstab(df["date"], df["place"]).reset_index()
places = ['cityA', "cityB", "cityC"]
df_new[places] = df_new[places].fillna(0).astype(int)
sums = df_new.select_dtypes(pd.np.number).sum().rename('total')
df_new = df_new.append(sums)
print(df_new)
Output:
place date cityA cityB cityC
0 2008-01-01 0.0 0.0 51.0
1 2009-06-01 0.0 618.0 0.0
2 2015-07-01 549.0 0.0 0.0
3 2016-01-01 41.0 0.0 0.0
4 2016-04-01 62.0 0.0 0.0
5 2017-01-01 800.0 0.0 0.0
6 2018-07-01 69.0 0.0 0.0
total NaT 1521.0 618.0 51.0
If there are NAs (which are floats in Pandas), the other values will be floats as well. See here.

Pandas DataFrame --> GroupBy --> MultiIndex Process

I'm trying to restructure a large DataFrame of the following form as a MultiIndex:
date store_nbr item_nbr units snowfall preciptotal event
0 2012-01-01 1 1 0 0.0 0.0 0.0
1 2012-01-01 1 2 0 0.0 0.0 0.0
2 2012-01-01 1 3 0 0.0 0.0 0.0
3 2012-01-01 1 4 0 0.0 0.0 0.0
4 2012-01-01 1 5 0 0.0 0.0 0.0
I want to group by store_nbr (1-45), within each store_nbr group by item_nbr (1-111) and then for the corresponding index pair (e.g., store_nbr=12, item_nbr=109), display the rows in chronological order, so that ordered rows will look like, for example:
store_nbr=12, item_nbr=109: date=2014-02-06, units=0, snowfall=...
date=2014-02-07, units=0, snowfall=...
date=2014-02-08, units=0, snowfall=...
... ...
store_nbr=12, item_nbr=110: date=2014-02-06, units=0, snowfall=...
date=2014-02-07, units=1, snowfall=...
date=2014-02-08, units=1, snowfall=...
...
It looks like some combination of groupby and set_index might be useful here, but I'm getting stuck after the following line:
grouped = stores.set_index(['store_nbr', 'item_nbr'])
This produces the following MultiIndex:
date units snowfall preciptotal event
store_nbr item_nbr
1 1 2012-01-01 0 0.0 0.0 0.0
2 2012-01-01 0 0.0 0.0 0.0
3 2012-01-01 0 0.0 0.0 0.0
4 2012-01-01 0 0.0 0.0 0.0
5 2012-01-01 0 0.0 0.0 0.0
Does anyone have any suggestions from here? Is there an easy way to do this by manipulating groupby objects?
You can sort your rows with:
df.sort_values(by='date')

apply values to pandas column

Having a dataframe as
dasz_id sector counts
0 0 dasz_id 2011.0
1 NaN wah11 0.0
2 NaN wah21 0.0
3 0 dasz_id 2012.0
4 NaN wah11 0.0
5 NaN wah21 0.0
I'm trying to get the daz_id value and apply it to all the rows until a new dasz value is appear, so the desired output would look as:
dasz_id sector counts
0 2011 dasz_id 2011.0
1 2011 wah11 0.0
2 2011 wah21 0.0
3 2012 dasz_id 2012.0
4 2012 wah11 0.0
5 2012 wah21 0.0
I've created a function using the apply method, which works to get the value over, but i don't know how to apply the values for the rest of the rows.
What am i doing wrong?
def dasz(row):
if row.sector == "dasz_id":
return int(row.counts)
else:
#get previous dasz_id value
e["dasz_id"] = e.apply(dasz, axis = 1)
I do not know why you have duplicated index, but here is one way
df['dasz_id'] = df['counts']
df['dasz_id'] = df['dasz_id'].replace({0:np.nan}).ffill()
df
Out[84]:
dasz_id sector counts
0 2011.0 dasz_id 2011.0
1 2011.0 wah11 0.0
2 2011.0 wah21 0.0
0 2012.0 dasz_id 2012.0
1 2012.0 wah11 0.0
2 2012.0 wah21 0.0
Using the dasz function that you created, and ffill function used by Wen, you could also do:
def dasz(row):
if row.sector == "dasz_id":
return row.counts
e["dasz_id"] = e.apply(dasz, axis = 1)
e.ffill(inplace=True)

Categories