There is the following data frame df:
df =
ID_DATA FD_1 FD_2 FD_3 FD_4 GRADE
111 23 12 34 45 1
111 23 67 45 5
111 12 67 45 23 5
222 23 55 66 4
222 55 66 4
I calculated the frequency per ID_DATA as follows:
freq = df.ID_DATA.value_counts().reset_index()
freq =
ID_DATA FREQ
111 3
222 2
However, I need to change the logic of this calculation as follows. There are two lists with different values of FD_*:
BaseList = [23,34]
AdjList = [12,45,67]
I need to count the frequency of the occurrence of values from these two lists in df. But there are some rules:
1) If a row contains any value of FD_* that belongs to AdjList, then BaseList should not be counted. The counting of BaseList should only be done if a row does not contain any value from AdjList.
2) If a row contains multiple values of BaseList, then it should be counted as +1.
3) If a row contains multiple values of AdjList, then only the last column FD_* should be counted.
The result should be this one:
ID_DATA FREQ_BaseList FREQ_12 FREQ_45 FREQ_67
111 0 0 3 0
222 1 0 0 0
The value of FREQ_BaseList is equal to 0 for 111, because of firing the rule #1.
The idea is to create custom function for this and then adjust it as needed. You can of course make it a bit more pretty by replacing hardcoded lists of columns:
>>> def worker1(x):
... b = 0
... for v in x:
... if v in AdjList:
... return ['FREQ_' + str(int(v)), 1]
... else:
... b = b + BaseList.count(v)
... return ('FREQ_BaseList', b)
...
>>> def worker2(x):
... r = worker1(x[['FD_4','FD_3','FD_2','FD_1']])
... return pd.Series([x['ID_DATA'], r[1]], index=['ID_DATA', r[0]])
...
>>> res = df.apply(worker2, axis=1).groupby('ID_DATA').sum()
>>> res
FREQ_45 FREQ_BaseList
ID_DATA
111.0 3.0 NaN
222.0 NaN 1.0
>>> res.reindex(columns=['FREQ_BaseList','FREQ_12','FREQ_45','FREQ_67']).fillna(0).astype(int)
FREQ_BaseList FREQ_12 FREQ_45 FREQ_67
ID_DATA
111.0 0 0 3 0
222.0 1 0 0 0
Related
My for loop does not read all the values in the range (2004,2012) which is super odd. When I try a simple function in my for loop such as return a, I do see that it read all the values in the range. However, when I use pd.read_json, it just does not work the same. I converted the data into dataframe but only one year is showing up in my dataframe. Am I missing something in my for loop?
test = range(2004, 2012)
testlist = list(test)
for i in testlist:
a = f"https://api.census.gov/data/{i}/cps/basic/jun?get=GTCBSA,PEMNTVTY&for=state:*"
b = pd.read_json(a)
c= pd.DataFrame(b.iloc[1:,]).set_axis(b.iloc[0,], axis="columns", inplace=False)
c['year'] = i
You're currently overwriting c in each pass of the loop. Instead, you need to concat the new data to the end of it:
test = range(2004, 2012)
testlist = list(test)
c = pd.DataFrame()
for i in testlist:
a = f"https://api.census.gov/data/{i}/cps/basic/jun?get=GTCBSA,PEMNTVTY&for=state:*"
b = pd.read_json(a)
b = pd.DataFrame(b.iloc[1:,]).set_axis(b.iloc[0,], axis="columns", inplace=False)
b['year'] = i
c = pd.concat([c, b])
Output:
0 GTCBSA PEMNTVTY state year
1 0 316 2 2004
2 0 57 2 2004
3 0 57 2 2004
4 0 57 2 2004
5 22900 57 5 2004
... ... ... ... ...
133679 0 120 56 2011
133680 0 57 56 2011
133681 0 57 56 2011
133682 0 57 56 2011
133683 0 57 56 2011
[1087063 rows x 4 columns]
Note you don't need to convert a range to a list to iterate it. You can simply do
for i in range(2004, 2012):
I have a Pandas data frame like below.
X Y Z
0 10 101 1
0 12 120 2
0 15 112 3
0 06 115 4
0 07 125 1
0 17 131 2
0 14 121 1
0 11 127 2
0 13 107 3
0 02 180 4
0 19 114 1
I want to calculate the average of the values in column X according to the group values in Z.
That is something like
X Z
(10+7+14+19)/4 1
(12+17+11)/2 2
(15+13)/2 3
(2+6/1) 4
What is an optimum way of doing this using Pandas?
It works this way,
sample_data = [['X','Y','Z'],[10,101,1],[12,120,2],[15,12 ,3],[6,115,4],[7,125,1],[17,131,2]]
def group_X_based_on_Z(data):
value_pair = [(row[2], row[0]) for row in data[1:]]
dictionary_with_groouped_values = {}
for z, x in value_pair:
dictionary_with_groouped_values.setdefault(z, []).append(x)
return dictionary_with_groouped_values
def cal_avg_values(data):
grouped_dictionary = group_X_based_on_Z(data)
avg_value_dictionary = {}
for z, x in grouped_dictionary.items():
avg_value_dictionary[z] = mean(x)
return avg_value_dictionary
print(cal_avg_values(sample_data))
I want to know whether there is a Pandas specific method for this?
Use the groupby function.
df.groupby('Z').agg(x_avg = ('X', 'mean'))
edit: forgot a ')'
Try
s=df.groupby('Z',as_index=False).X.mean()
Z X
0 1 12.500000
1 2 13.333333
2 3 14.000000
3 4 4.000000
Here is the dataframe:
df = pd.DataFrame({'A': ['14:45:18', '14:45:02', '14:30:04', '14:30:00', '14:29:54', '14:29:34'],
'B': [891.1, 891.1, 891.8, 891.1, 891.1, 891.2],
'C': [3317, 1, 10, 2, 32, 33]})
output:
A B C
0 14:45:18 891.1 3317 # <-- these two rows should be combined
1 14:45:02 891.1 1 # <-- these two rows should be combined
2 14:30:04 891.8 10
3 14:30:00 891.1 2 # <-- also these two rows should be combined
4 14:29:54 891.1 32 # <-- also these two rows should be combined
5 14:29:34 891.2 33
How can I make it like this:
A B C
0 14:45:18 891.1 3318
1 14:30:04 891.8 10
2 14:30:00 891.1 34
3 14:29:34 891.2 33
In summary, how can I compare each line B value and...
if they have same B value then sum their C values
also, keep the first same A value
Use:
groups = df.B.ne(df.B.shift()).cumsum()
aggregated_df = df.groupby(groups,as_index=False).agg({'A':'first','B':'first','C':'sum'})
print(aggregated_df)
A B C
0 14:45:18 891.1 3318
1 14:30:04 891.8 10
2 14:30:00 891.1 34
3 14:29:34 891.2 33
Lets say that I have a signal of 100 samples L=100
In this signal I found some intervals that I label as "OK". The intervals are stored in a Pandas DataFrame that looks like this:
c = pd.DataFrame(np.array([[10,26],[50,84]]),columns=['Start','End'])
c['Value']='OK'
How can I add the complementary intervals in another dataframe in order to have something like this
d = pd.DataFrame(np.array([[0,9],[10,26],[27,49],[50,84],[85,100]]),columns=['Start','End'])
d['Value']=['Check','OK','Check','OK','Check']
You can use the first Dataframe to create the second one and merge like suggested #jezrael :
d = pd.DataFrame({"Start":[0] + sorted(pd.concat([c.Start , c.End+1])), "End": sorted(pd.concat([c.Start-1 , c.End]))+[100]} )
d = pd.merge(d, c, how='left')
d['Value'] = d['Value'].fillna('Check')
d = d.reindex_axis(["Start","End","Value"], axis=1)
output
Start End Value
0 0 9 Check
1 10 26 OK
2 27 49 Check
3 50 84 OK
4 85 100 Check
I think you need:
d = pd.merge(d, c, how='left')
d['Value'] = d['Value'].fillna('Check')
print (d)
Start End Value
0 0 9 Check
1 10 26 OK
2 27 49 Check
3 50 84 OK
4 85 100 Check
EDIT:
You can use numpy.concatenate with numpy.sort, numpy.column_stack and DataFrame constructor for new df. Last need merge with fillna by dict for column for replace:
s = np.sort(np.concatenate([[0], c['Start'].values, c['End'].values + 1]))
e = np.sort(np.concatenate([c['Start'].values - 1, c['End'].values, [100]]))
d = pd.DataFrame(np.column_stack([s,e]), columns=['Start','End'])
d = pd.merge(d, c, how='left').fillna({'Value':'Check'})
print (d)
Start End Value
0 0 9 Check
1 10 26 OK
2 27 49 Check
3 50 84 OK
4 85 100 Check
EDIT1 :
For d was added new values by loc, rehape to Series by stack and shift. Last create df back by unstack:
b = c.copy()
max_val = 100
min_val = 0
c.loc[-1, 'Start'] = max_val + 1
a = c[['Start','End']].stack(dropna=False).shift().fillna(min_val - 1).astype(int).unstack()
a['Start'] = a['Start'] + 1
a['End'] = a['End'] - 1
a['Value'] = 'Check'
print (a)
Start End Value
0 0 9 Check
1 27 49 Check
-1 85 100 Check
d = pd.concat([b, a]).sort_values('Start').reset_index(drop=True)
print (d)
Start End Value
0 0 9 Check
1 10 26 OK
2 27 49 Check
3 50 84 OK
4 85 100 Check
I need a count of non-zero variables in pairs of rows.
I have a dataframe that lists density of species found at several sampling points. I need to know the total number of species found at each pair of sampling points. Here is an example of my data:
>>> import pandas
>>> df = pd.DataFrame({'ID':[111,222,333,444],'minnow':[1,3,5,4],'trout':[2,0,0,3],'bass':[0,1,3,0],'gar':[0,1,0,0]})
>>> df
ID bass gar minnow trout
0 111 0 0 1 2
1 222 1 1 3 0
2 333 3 0 5 0
3 444 0 0 4 3
I will pair the rows by ID number, so the pair (111,222) should return a total of 4, while the pair (111,333) should return a total of 3. I know I can get a sum of non-zeros for each row, but if I add those totals for each pair I will be double counting some of the species.
Here's an approach with NumPy -
In [35]: df
Out[35]:
ID bass gar minnow trout
0 111 0 0 1 2
1 222 1 1 3 0
2 333 3 0 5 0
3 444 0 0 4 3
In [36]: a = df.iloc[:,1:].values!=0
In [37]: r,c = np.triu_indices(df.shape[0],1)
In [38]: l = df.ID
In [39]: pd.DataFrame(np.column_stack((l[r], l[c], (a[r] | a[c]).sum(1))))
Out[39]:
0 1 2
0 111 222 4
1 111 333 3
2 111 444 2
3 222 333 3
4 222 444 4
5 333 444 3
You can do this using iloc for slicing and numpy
np.sum((df.iloc[[0, 1], 1:]!=0).any(axis=0))
Here df.iloc[[0, 1], 1:] gives you first two rows and numpy sum is counting the total number of non zero pairs in the selected row. You can use df.iloc[[0, 1], 1:] to select any combination of rows.
If the rows are sorted so that the two groups occur one after another, you could do
import pandas as pd
import numpy as np
x = np.random.randint(0,2,(10,3))
df = pd.DataFrame(x)
pair_a = df.loc[::2].reset_index(drop = True)
pair_b = df.loc[1::2].reset_index(drop = True)
paired = pd.concat([pair_a,pair_b],axis = 1)
Then find where paired is non-zero.