Adding up the values with the same index using numpy/python - python

I am a newbie to python and numpy. I want to find the total rainfall days (ie. sum of column E for each year, attach the image herewith).
I am using numpy.unique to find the unique elements of array year.
following is my attempt;
import numpy as np
data = np.genfromtxt("location/ofthe/file", delimiter = " ")
unique_year = np.unique(data[:,0], return_index=True)
print(unique_year)
j= input('select one of the unique year: >>> ')
#Then I want to give the sum of the rainfall days in that year.
I would appreciate if someone could help me.
Thanks in advance.

For such tasks, Pandas (which builds on NumPy) is more easily adaptable.
Here, you can use GroupBy to create a series mapping. You can then use your input to query your series:
import pandas as pd
# read file into dataframe
df = pd.read_excel('file.xlsx')
# create series mapping from GroupBy object
rain_days_by_year = df.groupby('year')['Rain days(in numbers)'].sum()
# get input as integer
j = int(input('select one of the unique year: >>> '))
# extract data
res = rain_days_by_year[j]

Related

ValueError: Cannot take a larger sample than population when 'replace=False' using Groupby pandas [duplicate]

This question already has answers here:
Select sample random groups after groupby in pandas?
(6 answers)
Closed last year.
I want to randomly pick-up i.e. 10 groups that I have in a dataframe, but i'm stuck with this error.
What can I do if I want to apply a groupby before the random selection?
I try the following approaches:
random_selection=tot_groups.groupby('query_col').apply(lambda x: x.sample(3))
random_selection=tot_groups.groupby('query_col').sample(n=10)
Error:
ValueError: Cannot take a larger sample than population when 'replace=False'
Thanks !
UPDATE:
Current dataset
ABG23209.1,UBH04469.1,89.655,145,15,0,1,145,19,163,3.63e-100,275.0
ABG23209.1,UBH04470.1,89.655,145,15,0,1,145,20,164,4.68e-100,275.0
ABG23209.1,UBH04471.1,89.655,145,15,0,1,145,19,163,4.83e-100,275.0
ABG23209.1,UBH04472.1,89.655,145,15,0,1,145,24,168,5.58e-100,275.0
KOX89835.1,SFN69046.1,79.07,86,18,0,1,86,12,97,1.36e-49,143.0
KOX89835.1,SFE98714.1,77.907,86,19,0,1,86,19,104,2.1400000000000002e-49,143.0
KOX89835.1,WP_086938959.1,76.471,85,20,0,1,85,4,88,1.25e-48,140.0
KOX89835.1,WP_231794161.1,76.471,85,20,0,1,85,5,89,1.75e-48,140.0
KOX89835.1,WP_231794169.1,75.294,85,21,0,1,85,5,89,2.41e-48,140.0
WP_001287378.1,QBP98897.1,86.765,136,17,1,1,135,1,136,1.68e-85,241.0
WP_001287378.1,WP_005164157.1,86.765,136,17,1,1,135,1,136,1.68e-85,241.0
WP_001287378.1,WP_085071573.1,86.667,135,18,0,1,135,1,135,1.73e-85,241.0
WP_001287378.1,WP_014608965.1,86.765,136,17,1,1,135,1,136,2.49e-85,240.0
WP_001287378.1,WP_004932170.1,86.667,135,18,0,1,135,1,135,6.88e-78,221.0
WP_001287378.1,GGD19357.1,91.912,136,10,1,1,136,1,135,1.01e-77,221.0
WP_001287378.1,OMQ27200.1,85.926,135,19,0,1,135,1,135,1.79e-77,221.0
XP_037955766.1,WP_229689219.1,93.583,374,24,0,3,376,5,378,0.0,745.0
XP_037955766.1,WP_229799179.1,93.583,374,24,0,3,376,1,374,0.0,744.0
XP_037955766.1,WP_017454560.1,92.308,377,28,1,1,376,1,377,0.0,738.0
XP_037955766.1,WP_108127780.1,92.838,377,26,1,1,376,1,377,0.0,736.0
Desidered output: Randomly select n groups in the dataframe, groupby query_col . I.e. with n=2:
WP_001287378.1,QBP98897.1,86.765,136,17,1,1,135,1,136,1.68e-85,241.0
WP_001287378.1,WP_005164157.1,86.765,136,17,1,1,135,1,136,1.68e-85,241.0
WP_001287378.1,WP_085071573.1,86.667,135,18,0,1,135,1,135,1.73e-85,241.0
WP_001287378.1,WP_014608965.1,86.765,136,17,1,1,135,1,136,2.49e-85,240.0
WP_001287378.1,WP_004932170.1,86.667,135,18,0,1,135,1,135,6.88e-78,221.0
WP_001287378.1,GGD19357.1,91.912,136,10,1,1,136,1,135,1.01e-77,221.0
WP_001287378.1,OMQ27200.1,85.926,135,19,0,1,135,1,135,1.79e-77,221.0
ABG23209.1,UBH04469.1,89.655,145,15,0,1,145,19,163,3.63e-100,275.0
ABG23209.1,UBH04470.1,89.655,145,15,0,1,145,20,164,4.68e-100,275.0
ABG23209.1,UBH04471.1,89.655,145,15,0,1,145,19,163,4.83e-100,275.0
ABG23209.1,UBH04472.1,89.655,145,15,0,1,145,24,168,5.58e-100,275.0
groupby's sample returns n element from each group. If the group doesn't contain at least n element, you'll get the error.
To select groups randomly, you count how many groups there are, then sample (without replacement) n numbers in the range [0,number of groups), and then return those lines, where the group's group number is equal to the sampled random numbers.
import random
import pandas as pd
random.seed(0)
tot_groups = pd.read_csv("data.csv",header=None).rename(columns={0:"query_col"})
grouped = tot_groups.groupby("query_col") # suppose you want to use this
group_selectors = random.sample(range(grouped.ngroups), k=2)
ret_df = tot_groups[grouped.ngroup().isin(group_selectors)]
print(ret_df)
However, there is no need to create any groupby object. You can collect the list of different query_col values, sample them, and return those lines, where the query_col has the right value:
import random
import pandas as pd
random.seed(0)
tot_groups = pd.read_csv("data.csv",header=None).rename(columns={0:"query_col"})
unique_queries = tot_groups["query_col"].unique().tolist()
selected_queries = random.sample(unique_queries,k=2)
ret_df = tot_groups[tot_groups["query_col"].isin(selected_queries)]
print(ret_df)

Parse CSV in 2D Python Object

i am trying to do Analysis on a CSV file which looks like this:
timestamp
value
1594512094.39
51
1594512094.74
76
1594512098.07
50.9
1594512099.59
76.80000305
1594512101.76
50.9
i am using pandas to import each column:
dataFrame = pandas.read_csv('iot_telemetry_data.csv')
graphDataHumidity: object = dataFrame.loc[:, "humidity"]
graphTime: object = dataFrame.loc[:, "ts"]
My Problem is i need to make a tuple of both columns, to filter the values of a specific time range, so for example i have my timestampBeginn of "1594512109.13668" and my "timestampEnd of "1594512129.37415" and i want to have the corresponding values to generate for example the mean value of the value of the specific time range.
I didn't find any solutions to this online and i don't know any libraries which solve this problem.
You can first filter the rows which timestamp values are between the 'start' and 'end.' Then you can calculate the values of the filtered rows, as follows:
(But, in the sample data, it seems that there is no row, which timestamp are between the range from 1594512109.13668 to 1594512129.37415. You can edit the range values as what you want.
import pandas as pd
df = pd.read_csv('iot_telemetry_data.csv')
start = 159451219.13668
end = 1594512129.37415
df = df[(df['timestamp'] >= start) & (df['timestamp'] <= end)]
average = df['value'].mean()
print(average)

Taking first value in a rolling window that is not numeric

This question follows one I previously asked here, and that was answered for numeric values.
I raise this 2nd one now relative to data of Period type.
While the example given below appears simple, I have actually windows that are of variable size. Interested in the 1st row of the windows, I am looking for a technic that makes use of this definition.
import pandas as pd
from random import seed, randint
# DataFrame
pi1h = pd.period_range(start='2020-01-01 00:00+00:00', end='2020-01-02 00:00+00:00', freq='1h')
seed(1)
values = [randint(0, 10) for ts in pi1h]
df = pd.DataFrame({'Values' : values, 'Period' : pi1h}, index=pi1h)
# This works (numeric type)
df['first'] = df['Values'].rolling(3).agg(lambda rows: rows[0])
# This doesn't (Period type)
df['OpeningPeriod'] = df['Period'].rolling(3).agg(lambda rows: rows[0])
Result of 2nd command
DataError: No numeric types to aggregate
Please, any idea? Thanks for any help! Bests,
First row of rolling window of size 3 means row, which is 2 rows above the current - just use pd.Series.shift(2):
df['OpeningPeriod'] = df['Period'].shift(2)
For the variable size (for the sake of example- I took Values column as this variable size):
import numpy as np
x=(np.arange(len(df))-df['Values'])
df['OpeningPeriod'] = np.where(x.ge(0), df.loc[df.index[x.tolist()], 'Period'], np.nan)
Convert your period[H] to a float
# convert to float
df['Period1'] = df['Period'].dt.to_timestamp().values.astype(float)
# rolling and convert back to period
df['OpeningPeriod'] = pd.to_datetime(df['Period1'].rolling(3)\
.agg(lambda rows: rows[0])).dt.to_period('1h')
# drop column
df = df.drop(columns='Period1')

Pandas select rows based on a function of a column

I am trying to learn Pandas. I have found several examples on how to construct a pandas dataframe and how to add columns, they work nicely. I would like to learn to select all rows based on a value of a column. I have found multiple examples on how to perform selection if a value of a column should be smaller or greater than a certain number, that also works. My question is how to do a more general selection, where I would like to first compute a function of a column, then select all rows for which the value of a function would be greater or smaller than a certain number
import names
import numpy as np
import pandas as pd
from datetime import date
import random
def randomBirthday(startyear, endyear):
T1 = date.today().replace(day=1, month=1, year=startyear).toordinal()
T2 = date.today().replace(day=1, month=1, year=endyear).toordinal()
return date.fromordinal(random.randint(T1, T2))
def age(birthday):
today = date.today()
return today.year - birthday.year - ((today.month, today.day) < (birthday.month, birthday.day))
N_PEOPLE = 20
dict_people = { }
dict_people['gender'] = np.array(['male','female'])[np.random.randint(0, 2, N_PEOPLE)]
dict_people['names'] = [names.get_full_name(gender=g) for g in dict_people['gender']]
peopleFrame = pd.DataFrame(dict_people)
# Example 1: Add new columns to the data frame
peopleFrame['birthday'] = [randomBirthday(1920, 2020) for i in range(N_PEOPLE)]
# Example 2: Select all people with a certain age
peopleFrame.loc[age(peopleFrame['birthday']) >= 20]
This code works except for the last line. Please suggest what is the correct way to write this line. I have considered adding an extra column with the value of the function age, and then selecting based on its value. That would work. But I am wondering if I have to do it. What if I don't want to store the age of a person, only use it for selection
Use Series.apply:
peopleFrame.loc[peopleFrame['birthday'].apply(age) >= 20]

Taking a proportion of a dataframe based on column values

I have a Pandas dataframe with ~50,000 rows and I want to randomly select a proportion of rows from that dataframe based on a number of conditions. Specifically, I have a column called 'type of use' and, for each field in that column, I want to select a different proportion of rows.
For instance:
df[df['type of use'] == 'housing'].sample(frac=0.2)
This code returns 20% of all the rows which have 'housing' as their 'type of use'. The problem is I do not know how to do this for the remaining fields in a way that is 'idiomatic'. I also do not know how I could take the result from this sampling to form a new dataframe.
You can make a unique list for all the values in the column by list(df['type of use'].unique()) and iterate like below:
for i in list(df['type of use'].unique()):
print(df[df['type of use'] == i].sample(frac=0.2))
or
i = 0
while i < len(list(df['type of use'].unique())):
df1 = df[(df['type of use']==list(df['type of use'].unique())[i])].sample(frac=0.2)
print(df1.head())
i = i + 1
For storing you can create a dictionary:
dfs = ['df' + str(x) for x in list(df2['type of use'].unique())]
dicdf = dict()
i = 0
while i < len(dfs):
dicdf[dfs[i]] = df[(df['type of use']==list(df2['type of use'].unique())[i])].sample(frac=0.2)
i = i + 1
print(dicdf)
This will print a dictionary of the dataframes.
You can print what you like to see for example for housing sample : print (dicdf['dfhousing'])
Sorry this is coming in 2+ years late, but I think you can do this without iterating, based on help I received to a similar question here. Applying it to your data:
import pandas as pd
import math
percentage_to_flag = 0.2 #I'm assuming you want the same %age for all 'types of use'?
#First, create a new 'helper' dataframe:
random_state = 41 # Change to get different random values.
df_sample = df.groupby("type of use").apply(lambda x: x.sample(n=(math.ceil(percentage_to_flag * len(x))),random_state=random_state))
df_sample = df_sample.reset_index(level=0, drop=True) #may need this to simplify multi-index dataframe
# Now, mark the random sample in a new column in the original dataframe:
df["marked"] = False
df.loc[df_sample.index, "marked"] = True

Categories