Specific data from a dataframe based on date in another column - python

#app.callback(
Output('stats', 'children'),
Input('picker_main', 'date'))
def update_table(date_value):
table = {}
for query_id in queries_daily:
df_temp = data_manager.data[query_id]
df_temp.set_index('day')
try:
table[query_id] = df_temp[query_id].where(df_temp['day'] == datetime.strptime(date_value, "%Y-%m-%d").date())
#table[query_id] = df_temp.loc["day", query_id].where(df_temp['day'] == temp)
#table[query_id] = df_temp[df_temp["day"] == temp]
except Exception as e:
table[query_id] = 0
print(e)
I'm trying to get a row from the dateframe and store it in a dictionary or another dateframe.
It's actually only about this bit:
table[query_id] = df_temp[query_id].where(df_temp['day'] == datetime.strptime(date_value, "%Y-%m-%d").date())
table --> empty dict
df_temp --> df with 2 columns - first with name in variable "query_id" and second with date. From which I'm trying to get value stored in the column named with "query_id" along with the "query_id" keyword.
I've been trying also converting date to string format and using dataframe instead of an empty dictionary
It doesn't return any data. I posted a longer piece of code at first as was wondering if maybe someone spot that I can do something in a better way
Thanks!

Seems like you need query
table = { query_id : data_manager.data[query_id]\
.query(f'day == "{date_value}"')['day']
for query_id in queries_daily }

Related

Counting combinations in Dataframe create new Dataframe

So I have a dataframe called reactions_drugs
and I want to create a table called new_r_d where I keep track of how often a see a symptom for a given medication like
Here is the code I have but I am running into errors such as "Unable to coerce to Series, length must be 3 given 0"
new_r_d = pd.DataFrame(columns = ['drugname', 'reaction', 'count']
for i in range(len(reactions_drugs)):
name = reactions_drugs.drugname[i]
drug_rec_act = reactions_drugs.drug_rec_act[i]
for rec in drug_rec_act:
row = new_r_d.loc[(new_r_d['drugname'] == name) & (new_r_d['reaction'] == rec)]
if row == []:
# create new row
new_r_d.append({'drugname': name, 'reaction': rec, 'count': 1})
else:
new_r_d.at[row,'count'] += 1
Assuming the rows in your current reactions (drug_rec_act) column contain one string enclosed in a list, you can convert the values in that column to lists of strings (by splitting each string on the comma delimiter) and then utilize the explode() function and value_counts() to get your desired result:
df['drug_rec_act'] = df['drug_rec_act'].apply(lambda x: x[0].split(','))
df_long = df.explode('drug_rec_act')
result = df_long.groupby('drugname')['drug_rec_act'].value_counts().reset_index(name='count')

How to return a non empty data frame among multiple newly created data frames inside a function in python?

I am new to pandas, I have a doubt in returning a data frame from a function. I have a function which creates three new data frames based on the parameters given to it, the function has to return only the data frames which are non-empty. How do I do that?
my code:
def df_r(df,colname,t1):
t1_df = pd.DataFrame()
t2_df = pd.DataFrame()
t3_df = pd.DataFrame()
if t1 :
for colname in df:
some code
some code
t1_df = some data
if t2 :
for colname in df:
some code
some code
t2_df = some data
if t3 :
for colname in df:
some code
some code
t3_df = some data
list = [t1_df,t2_df,t3_df]
Now it should return only the t1_df as the parameter was given t1. So I have inserted all three into a list
list = [t1_df,t2_df,t3_df]
how to check if which df is non-empty and return it?
Just check for empty attribute for each DataFrame
eg.
df = pd.DataFrame()
if df.empty:
print("DataFrame is empty")
output:
DataFrame is empty
pd.empty would return True if DataFrame is empty, else it would return False
This would work even if column names are present but are still missing the data.
So to answer specific to your case
list = [t1_df,t2_df,t3_df]
for df in list:
if not df.empty:
return df
assuming your case has only one of the DataFrame non-empty
if t1_df.empty != True:
return t1_df
elif t2_df.empty !=True:
return t2_df
else:
return t2_df

Pandas apply function, receiving KeyError 'Column Name'

My dataset has a column called age and I'm trying to count the null values.
I know it can be easily achieved by doing something like len(df) - df['age'].count(). However, I'm playing around with functions and just like to apply the function to calculate the null count.
Here is what I have:
def age_is_null(df):
age_col = df['age']
null = df[age_col].isnull()
age_null = df[null]
return len(age_null)
count = df.apply(age_is_null)
print (count)
When I do that, I received an error: KeyError: 'age'.
Can someone tells me why I'm getting that error and what should I change in the code to make it work?
You need DataFrame.pipe or pass DataFrame to function here:
#function should be simplify
def age_is_null(df):
return df['age'].isnull().sum()
count = df.pipe(age_is_null)
print (count)
count = age_is_null(df)
print (count)
Error means if use DataFrame.apply then iterate by columns, so it failed if want select column age.
def func(x):
print (x)
df.apply(func)
EDIT: For selecting column use column name:
def age_is_null(df):
age_col = 'age' <- here
null = df[age_col].isnull()
age_null = df[null]
return len(age_null)
Or pass selected column for mask:
def age_is_null(df):
age_col = df['age']
null = age_col.isnull() <- here
age_null = df[null]
return len(age_null)
Instead of making a function, you can Try this
df[df["age"].isnull() == True].shape
You need to pass dataframe df while calling the function age_is_null.That's why age column is not recognised.
count = df.apply(age_is_null(df))

Change Column values in pandas applying another function

I have a data frame in pandas, one of the columns contains time intervals presented as strings like 'P1Y4M1D'.
The example of the whole CSV:
oci,citing,cited,creation,timespan,journal_sc,author_sc
0200100000236252421370109080537010700020300040001-020010000073609070863016304060103630305070563074902,"10.1002/pol.1985.170230401","10.1007/978-1-4613-3575-7_2",1985-04,P2Y,no,no
...
I created a parsing function, that takes that string 'P1Y4M1D' and returns an integer number.
I am wondering how is it possible to change all the column values to parsed values using that function?
def do_process_citation_data(f_path):
global my_ocan
my_ocan = pd.read_csv("citations.csv",
names=['oci', 'citing', 'cited', 'creation', 'timespan', 'journal_sc', 'author_sc'],
parse_dates=['creation', 'timespan'])
my_ocan = my_ocan.iloc[1:] # to remove the first row iloc - to select data by row numbers
my_ocan['creation'] = pd.to_datetime(my_ocan['creation'], format="%Y-%m-%d", yearfirst=True)
return my_ocan
def parse():
mydict = dict()
mydict2 = dict()
i = 1
r = 1
for x in my_ocan['oci']:
mydict[x] = str(my_ocan['timespan'][i])
i +=1
print(mydict)
for key, value in mydict.items():
is_negative = value.startswith('-')
if is_negative:
date_info = re.findall(r"P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?$", value[1:])
else:
date_info = re.findall(r"P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?$", value)
year, month, day = [int(num) if num else 0 for num in date_info[0]] if date_info else [0,0,0]
daystotal = (year * 365) + (month * 30) + day
if not is_negative:
#mydict2[key] = daystotal
return daystotal
else:
#mydict2[key] = -daystotal
return -daystotal
#print(mydict2)
#return mydict2
Probably I do not even need to change the whole column with new parsed values, the final goal is to write a new function that returns average time of ['timespan'] of docs created in a particular year. Since I need parsed values, I thought it would be easier to change the whole column and manipulate a new data frame.
Also, I am curious what could be a way to apply the parsing function on each ['timespan'] row without modifying a data frame, I can only assume It could be smth like this, but I don't have a full understanding of how to do that:
for x in my_ocan['timespan']:
x = parse(str(my_ocan['timespan'])
How can I get a column with new values? Thank you! Peace :)
A df['timespan'].apply(parse) (as mentioned by #Dan) should work. You would need to modify only the parse function in order to receive the string as an argument and return the parsed string at the end. Something like this:
import pandas as pd
def parse_postal_code(postal_code):
# Splitting postal code and getting first letters
letters = postal_code.split('_')[0]
return letters
# Example dataframe with three columns and three rows
df = pd.DataFrame({'Age': [20, 21, 22], 'Name': ['John', 'Joe', 'Carla'], 'Postal Code': ['FF_222', 'AA_555', 'BB_111']})
# This returns a new pd.Series
print(df['Postal Code'].apply(parse_postal_code))
# Can also be assigned to another column
df['Postal Code Letter'] = df['Postal Code'].apply(parse_postal_code)
print(df['Postal Code Letter'])

How to set parameter to loop from specific row in excel using python

The code read data from specific column in excel column ( in my case i used columns = 'profile')
The result is in dataframe as below:
profile
0 https://scontent-lga3-1.xx.fbcdn.net/v/t1.0-1/...
1 https://scontent-lga3-1.xx.fbcdn.net/v/t1.0-1/...
2 https://scontent-lga3-1.xx.fbcdn.net/v/t1.0-1/...
So, I try to loop the data in dataframe. My problem is the algorithm includes the header (profile) as well, so it turns error. Below is my work:
results = []
for result in df :
result = CF.face.detect(result)
if result == []:
#do something
else:
#do something
print(results)
The error I got from this code is (invalid as it loop the 'profile' as well):
status_code: 400
code: InvalidURL
code: InvalidURL
message: Invalid image URL.
My question is, how to write the code so it will loop all the data within column (excluding the 'profile')? I am not sure if put 'df' in 'for result in df ' is a correct way or vice versa.
If you want to loop through one column in the the dataframe, you can refer to it and then write your loop
for result in df['profile'] :
result = CF.face.detect(result)
if result == []:
#do something
else:
#do something
print(results)

Categories