Lets say that I have a signal of 100 samples L=100
In this signal I found some intervals that I label as "OK". The intervals are stored in a Pandas DataFrame that looks like this:
c = pd.DataFrame(np.array([[10,26],[50,84]]),columns=['Start','End'])
c['Value']='OK'
How can I add the complementary intervals in another dataframe in order to have something like this
d = pd.DataFrame(np.array([[0,9],[10,26],[27,49],[50,84],[85,100]]),columns=['Start','End'])
d['Value']=['Check','OK','Check','OK','Check']
You can use the first Dataframe to create the second one and merge like suggested #jezrael :
d = pd.DataFrame({"Start":[0] + sorted(pd.concat([c.Start , c.End+1])), "End": sorted(pd.concat([c.Start-1 , c.End]))+[100]} )
d = pd.merge(d, c, how='left')
d['Value'] = d['Value'].fillna('Check')
d = d.reindex_axis(["Start","End","Value"], axis=1)
output
Start End Value
0 0 9 Check
1 10 26 OK
2 27 49 Check
3 50 84 OK
4 85 100 Check
I think you need:
d = pd.merge(d, c, how='left')
d['Value'] = d['Value'].fillna('Check')
print (d)
Start End Value
0 0 9 Check
1 10 26 OK
2 27 49 Check
3 50 84 OK
4 85 100 Check
EDIT:
You can use numpy.concatenate with numpy.sort, numpy.column_stack and DataFrame constructor for new df. Last need merge with fillna by dict for column for replace:
s = np.sort(np.concatenate([[0], c['Start'].values, c['End'].values + 1]))
e = np.sort(np.concatenate([c['Start'].values - 1, c['End'].values, [100]]))
d = pd.DataFrame(np.column_stack([s,e]), columns=['Start','End'])
d = pd.merge(d, c, how='left').fillna({'Value':'Check'})
print (d)
Start End Value
0 0 9 Check
1 10 26 OK
2 27 49 Check
3 50 84 OK
4 85 100 Check
EDIT1 :
For d was added new values by loc, rehape to Series by stack and shift. Last create df back by unstack:
b = c.copy()
max_val = 100
min_val = 0
c.loc[-1, 'Start'] = max_val + 1
a = c[['Start','End']].stack(dropna=False).shift().fillna(min_val - 1).astype(int).unstack()
a['Start'] = a['Start'] + 1
a['End'] = a['End'] - 1
a['Value'] = 'Check'
print (a)
Start End Value
0 0 9 Check
1 27 49 Check
-1 85 100 Check
d = pd.concat([b, a]).sort_values('Start').reset_index(drop=True)
print (d)
Start End Value
0 0 9 Check
1 10 26 OK
2 27 49 Check
3 50 84 OK
4 85 100 Check
Related
My for loop does not read all the values in the range (2004,2012) which is super odd. When I try a simple function in my for loop such as return a, I do see that it read all the values in the range. However, when I use pd.read_json, it just does not work the same. I converted the data into dataframe but only one year is showing up in my dataframe. Am I missing something in my for loop?
test = range(2004, 2012)
testlist = list(test)
for i in testlist:
a = f"https://api.census.gov/data/{i}/cps/basic/jun?get=GTCBSA,PEMNTVTY&for=state:*"
b = pd.read_json(a)
c= pd.DataFrame(b.iloc[1:,]).set_axis(b.iloc[0,], axis="columns", inplace=False)
c['year'] = i
You're currently overwriting c in each pass of the loop. Instead, you need to concat the new data to the end of it:
test = range(2004, 2012)
testlist = list(test)
c = pd.DataFrame()
for i in testlist:
a = f"https://api.census.gov/data/{i}/cps/basic/jun?get=GTCBSA,PEMNTVTY&for=state:*"
b = pd.read_json(a)
b = pd.DataFrame(b.iloc[1:,]).set_axis(b.iloc[0,], axis="columns", inplace=False)
b['year'] = i
c = pd.concat([c, b])
Output:
0 GTCBSA PEMNTVTY state year
1 0 316 2 2004
2 0 57 2 2004
3 0 57 2 2004
4 0 57 2 2004
5 22900 57 5 2004
... ... ... ... ...
133679 0 120 56 2011
133680 0 57 56 2011
133681 0 57 56 2011
133682 0 57 56 2011
133683 0 57 56 2011
[1087063 rows x 4 columns]
Note you don't need to convert a range to a list to iterate it. You can simply do
for i in range(2004, 2012):
I have a dataframe with 9 columns, two of which are gender and smoker status. Every row in the dataframe is a person, and each column is their entry on a particular trait.
I want to count the number of entries that satisfy the condition of being both a smoker and is male.
I have tried using a sum function:
maleSmoke = sum(1 for i in data['gender'] if i is 'm' and i in data['smoker'] if i is 1 )
but this always returns 0. This method works when I only check one criteria however and I can't figure how to expand it to a second.
I also tried writing a function that counted its way through every entry into the dataframe but this also returns 0 for all entries.
def countSmokeGender(df):
maleSmoke = 0
femaleSmoke = 0
maleNoSmoke = 0
femaleNoSmoke = 0
for i in range(20000):
if df['gender'][i] is 'm' and df['smoker'][i] is 1:
maleSmoke = maleSmoke + 1
if df['gender'][i] is 'f' and df['smoker'][i] is 1:
femaleSmoke = femaleSmoke + 1
if df['gender'][i] is 'm' and df['smoker'][i] is 0:
maleNoSmoke = maleNoSmoke + 1
if df['gender'][i] is 'f' and df['smoker'][i] is 0:
femaleNoSmoke = femaleNoSmoke + 1
return maleSmoke, femaleSmoke, maleNoSmoke, femaleNoSmoke
I've tried pulling out the data sets as numpy arrays and counting those but that wasn't working either.
Are you using pandas?
Assuming you are, you can simply do this:
# How many male smokers
len(df[(df['gender']=='m') & (df['smoker']==1)])
# How many female smokers
len(df[(df['gender']=='f') & (df['smoker']==1)])
# How many male non-smokers
len(df[(df['gender']=='m') & (df['smoker']==0)])
# How many female non-smokers
len(df[(df['gender']=='f') & (df['smoker']==0)])
Or, you can use groupby:
df.groupby(['gender'])['smoker'].sum()
Another alternative, which is great for data exploration: .pivot_table
With a DataFrame like this
id gender smoker other_trait
0 0 m 0 0
1 1 f 1 1
2 2 m 1 0
3 3 m 1 1
4 4 f 1 0
.. .. ... ... ...
95 95 f 0 0
96 96 f 1 1
97 97 f 0 1
98 98 m 0 0
99 99 f 1 0
you could do
result = df.pivot_table(
index="smoker", columns="gender", values="id", aggfunc="count"
)
to get a result like
gender f m
smoker
0 32 16
1 27 25
If you want to display the partial counts you can add the margins=True option and get
gender f m All
smoker
0 32 16 48
1 27 25 52
All 59 41 100
If you don't have a column to count over (you can't use smoker and gender because they are used for the labels) you could add a dummy column:
result = df.assign(dummy=1).pivot_table(
index="smoker", columns="gender", values="dummy", aggfunc="count",
margins=True
)
I'm writing a code to merge several dataframe together using pandas .
Here is my first table :
Index Values Intensity
1 11 98
2 12 855
3 13 500
4 24 140
and here is the second one:
Index Values Intensity
1 21 1000
2 11 2000
3 24 0.55
4 25 500
With these two df, I concanate and drop_duplicates the Values columns which give me the following df :
Index Values Intensity_df1 Intensity_df2
1 11 0 0
2 12 0 0
3 13 0 0
4 24 0 0
5 21 0 0
6 25 0 0
I would like to recover the intensity of each values in each Dataframes, for this purpose, I'm iterating through each line of each df which is very inefficient. Here is the following code I use:
m = 0
while m < len(num_df):
n = 0
while n < len(df3):
temp_intens_abs = df[m]['Intensity'][df3['Values'][n] == df[m]['Values']]
if temp_intens_abs.empty:
merged.at[n,"Intensity_df%s" %df[m]] = 0
else:
merged.at[n,"Intensity_df%s" %df[m]] = pandas.to_numeric(temp_intens_abs, errors='coerce')
n = n + 1
m = m + 1
The resulting df3 looks like this at the end:
Index Values Intensity_df1 Intensity_df2
1 11 98 2000
2 12 855 0
3 13 500 0
4 24 140 0.55
5 21 0 1000
6 25 0 500
My question is : Is there a way to directly recover "present" values in a df by comparing directly two columns using pandas? I've tried several solutions using numpy but without success.. Thanks in advance for your help.
You can try joining these dataframes: df3 = df1.merge(df2, on="Values")
Let us assume I am taking a temperature measurement on a regular interval and recording the values in a Pandas Dataframe
day temperature [F]
0 89
1 91
2 93
3 88
4 90
Now I want to create another column which is set to 1 if and only if the two previous values are above a certain level. In my scenario I want to create a column value of 1 if the two consecutive values are above 90, thus yielding
day temperature Above limit?
0 89 0
1 91 0
2 93 1
3 88 0
4 91 0
5 91 1
6 93 1
Despite some SO and Google digging, it's not clear if I can use iloc[x], loc[x] or something else in a for loop?
You are looking for the shift function in pandas.
import io
import pandas as pd
data = """
day temperature Expected
0 89 0
1 91 0
2 93 1
3 88 0
4 91 0
5 91 1
6 93 1
"""
data = io.StringIO(data)
df = pd.read_csv(data, sep='\s+')
df['Result'] = ((df['temperature'].shift(1) > 90) & (df['temperature'] > 90)).astype(int)
# Validation
(df['Result'] == df['Expected']).all()
Try this:
df = pd.DataFrame({'temperature': [89, 91, 93, 88, 90, 91, 91, 93]})
limit = 90
df['Above'] = ((df['temperature']>limit) & (df['temperature'].shift(1)>limit)).astype(int)
df
In the future, please include the code to testing (in this case the df construction line)
df['limit']=""
df.iloc[0,2]=0
for i in range (1,len(df)):
if df.iloc[i,1]>90 and df.iloc[i-1,1]>90:
df.iloc[i,2]=1
else:
df.iloc[i,2]=0
Here iloc[i,2] refers to ith row index and 2 column index(limit column). Hope this helps
Solution using shift():
>> threshold = 90
>> df['Above limit?'] = 0
>> df.loc[((df['temperature [F]'] > threshold) & (df['temperature [F]'].shift(1) > threshold)), 'Above limit?'] = 1
>> df
day temperature [F] Above limit?
0 0 89 0
1 1 91 0
2 2 93 1
3 3 88 0
4 4 90 0
Try using rolling(window = 2) and then apply() as follows:
df["limit"]=df['temperature'].rolling(2).apply(lambda x: int(x[0]>90)&int(x[-1]> 90))
There is the following data frame df:
df =
ID_DATA FD_1 FD_2 FD_3 FD_4 GRADE
111 23 12 34 45 1
111 23 67 45 5
111 12 67 45 23 5
222 23 55 66 4
222 55 66 4
I calculated the frequency per ID_DATA as follows:
freq = df.ID_DATA.value_counts().reset_index()
freq =
ID_DATA FREQ
111 3
222 2
However, I need to change the logic of this calculation as follows. There are two lists with different values of FD_*:
BaseList = [23,34]
AdjList = [12,45,67]
I need to count the frequency of the occurrence of values from these two lists in df. But there are some rules:
1) If a row contains any value of FD_* that belongs to AdjList, then BaseList should not be counted. The counting of BaseList should only be done if a row does not contain any value from AdjList.
2) If a row contains multiple values of BaseList, then it should be counted as +1.
3) If a row contains multiple values of AdjList, then only the last column FD_* should be counted.
The result should be this one:
ID_DATA FREQ_BaseList FREQ_12 FREQ_45 FREQ_67
111 0 0 3 0
222 1 0 0 0
The value of FREQ_BaseList is equal to 0 for 111, because of firing the rule #1.
The idea is to create custom function for this and then adjust it as needed. You can of course make it a bit more pretty by replacing hardcoded lists of columns:
>>> def worker1(x):
... b = 0
... for v in x:
... if v in AdjList:
... return ['FREQ_' + str(int(v)), 1]
... else:
... b = b + BaseList.count(v)
... return ('FREQ_BaseList', b)
...
>>> def worker2(x):
... r = worker1(x[['FD_4','FD_3','FD_2','FD_1']])
... return pd.Series([x['ID_DATA'], r[1]], index=['ID_DATA', r[0]])
...
>>> res = df.apply(worker2, axis=1).groupby('ID_DATA').sum()
>>> res
FREQ_45 FREQ_BaseList
ID_DATA
111.0 3.0 NaN
222.0 NaN 1.0
>>> res.reindex(columns=['FREQ_BaseList','FREQ_12','FREQ_45','FREQ_67']).fillna(0).astype(int)
FREQ_BaseList FREQ_12 FREQ_45 FREQ_67
ID_DATA
111.0 0 0 3 0
222.0 1 0 0 0