Take intersection of data-frames in a for loop

Take intersection of data-frames in a for loop - python

I have a data-frame of 4 columns
DateTime | WindSpeed1 | WindSpeed2 |Direction
I have created a for loop where I am taking DateTime, WindSpeed1, WindSpeed2 columns iteratively and drop the rows with less than 3m/s wind speed. My question is-
How to access the iterations of a for loop in order to merge those iterations for the common DateTime index.
I am using Python 2.7 by the way.
Thank you very much in advance.

I am sorry for not being clear enough. What I have mentioned above is actually for a GUI that I am programming.
As you can see below I am taking the data frame and delete wind speeds less than a user given value(result[4]). But since "v_range2" differs from iteration to iteration I would like to find a way to merge every iteration for the common DataTime.
Do you have an idea how I can do that?
for i in range(0, len(dirbins)): # iterates on sectors
meansd = [] # mean ws by sector
heightsd = []
for j in result[0]: # iterates on ws
if j != "":
v = m.datafr(df).df[j]
v_range1 = v >= result[4]
v_range2 = v[v_range1]
vidx = m.chnames().index(j)
heightsd.append(float(str(m.channels[vidx].height)))
meansd.append(v_range2[digitized_d == i].mean()) # mean ws by sector

Related

calculate descriptive statistics in pandas dataframe based on a condition cyclically

I have a pandas DataFrame with more than 100 thousands of rows. Index represents the time and two columns represents the sensor data and the condition.
When the condition becomes 1, I want to start calculating score card (average and standard deviation) till the next 1 comes. This needs to be calculated for the whole dataset.
Here is a picture of the DataFrame for a specific time span:
What I thought is to iterate through index and items of the df and when condition is met I start to calculate the descriptive statistics.
cycle = 0
for i, row in df_b.iterrows():
if row['condition'] == 1:
print('Condition is changed')
cycle += 1
print('cycle: ', cycle)
#start = ?
#end = ?
#df_b.loc[start:end]
I am not sure how to calculate start and end for this DataFrame. The end will be the start for the next cycle. Additionally, I think this iteration is not the optimal one because it takes a bit of long time to iterate. I appreciate any idea or solution for this problem.

Maybe start out with getting the rows where condition == 1:
cond_1_df = df.loc[df['condition'] == 1]
This dataframe will only contain the rows that meet your condition (being 1).
From here on, you can access the timestamps pairwise, meaning that the first element is beginning and second element is end, sketched below:
former = 0
stamp_pairs = []
df = cond_1_df.reset_index() # make sure indexes pair with number of rows
for index, row in df.iterrows():
if former != 0:
beginning = former
end = row["timestamp"]
former = row["timestamp"]
else:
beginning = 0
end = row["timestamp"]
former = row["timestamp"]
stamp_pairs.append([beginning, end])
This should give you something like this:
[[stamp0, stamp1], [stamp1,stamp2], [stamp2, stamp3]...]
for each of these pairs, you can again create a df containing only the subset of rows where stamp_x < timestamp < stamp_x+1:
time_cond_df = df.loc[(df['timestamp'] > stamp_x) & (df['timestamp'] < stamp_x+1)]
Finally, you get one time_cond_df per timestamp tuple, on which you can perform your score calculations.
Just make shure that your timestamps are comparable with operators ">" and "<"! We can't tell since you did not explicate how you produced the timestamps.

A dataframe splitting problem in Pandas, any thoughts?

The probe of an instrument is cycling back and forward along an x direction while is recording its position and acquiring the measurements. The probe makes 10 cycles, let's say from 0 to 10 um (go and back) and records the measurements. This gives 2 columns of data: position and measurement, where the position number cycle 0um->10um->0->10->0..., but these numbers have an experimental error so they are all different.
I need to split the dataframe at the beginning of each cycle. Any interesting strategy to tackle this problem? Please, let me know if you need more info. Thank in advance.
Below is link to an example of the dataframe that I have.
https://www.dropbox.com/s/af4r8lw5lfhwexr/Example.xlsx?dl=0
In this example the instrument made 3 cycles and generated the data (measurement). Cycle 1 = Index 0-20; Cycle 1 = Index 20-40; and Cycle 1 = Index 40-60. I need to divide this dataframe into 3 dataframes, one for each cycle (Index 0-20; Index 20-40; Index 40-60).
The tricky part is that the method needs to be "general" because each cycle can have a different number of datapoints (in this example that is fixed to 20), and different experiments can be performed with a different number of cycles.

My objective is to keep tract when the numbers start increasing again after decreasing to determine the cycle number. Not very elegant sorry.
import pandas as pd
df = pd.read_excel('Example.xlsx')
def cycle(array):
increasing = 1
cycle_num = 0
answer = []
for ind,val in enumerate(array):
try:
if array[ind+1]-array[ind]>=0:
if increasing==0:
cycle_num+=1
increasing=1
answer.append(cycle_num)
else:
answer.append(cycle_num)
increasing=0
except:
answer.append(cycle_num)
return answer
df['Cycle'] = cycle(df['Distance'].to_list())
grouped = df.groupby(['Cycle'])
print(grouped.get_group(0))
print(grouped.get_group(1))
print(grouped.get_group(2))

How to create an ID starting from 1 that increases by 1 every time the previous row of another column is a specific value

I'm working in Python, and I need to create a Journey ID and a Journey number. See picture for illustration. ID should increase every time previous row of the column "Purpose" takes the value 1. Journey number does the same but within each Respondent ID.
GezPerVer2019['JourneyID'] = np.where(GezPerVer2019['Hoofddoel'] = 1, GezPerVer2019['JourneyID'][i+1] + 1, GezPerVer2019['JourneyID'][i-1])
Is what I've tried. Obviously, I'm not yet too skilled at this and I think the problem is that np.where doesn't allow the [i] indicators.
Any help will be greatly appreciated.

Use boolean indexing and cumsum here instead:
m = df['Purpose'] == 1
df.loc[m, 'JourneyID'] = m.cumsum()
Note: = is for assignment, == for comparisson. You want the latter here when comparing with 1

What is the most efficient way to count the number of instances occurring within a time frame in python?

I am trying to run a simple count function which runs a dataframe of event times (specifically surgeries) against another dataframe of shift time frames, and returns a list of how many events occur during each shift. These csvs are thousands of rows, though, so while the way I have it set up currently works, it takes forever. This is what I have:
numSurgeries = [0 for shift in range(len(df.Date))]
for i in range(len(OR['PATIENT_IN_ROOM_DTTM'])):
for shift in range(len(df.DateTime)):
if OR['PATIENT_IN_ROOM_DTTM'][i] >= df.DateTime[shift] and OR['PATIENT_IN_ROOM_DTTM'][i] < df.DateTime[shift+1]:
numSurgeries[shift] += 1
So it loops through each event and checks to see which shift time frame it is in, then increments the count for that time frame. Logical, works, but definitely not efficient.
EDIT:
Example of OR data file
Example of df data file

Without example data, it's not absolutely clear what you want. But this should help you vectorise:
numSurgeries = {shift: np.sum((OR['PATIENT_IN_ROOM_DTTM'] >= df.DateTime[shift]) & \
(OR['PATIENT_IN_ROOM_DTTM'] < df.DateTime[shift+1])) \
for shift in range(len(df.Date))}
The output is a dictionary mapping integer shift to numSurgeries.

As mentioned above, it is hard to answer without example data.
However, a boolean mask sounds fitting. See Select dataframe rows between two dates.
Create a date mask from shift, we'll call the start and end dates start_shift and end_shift respectively. These should be in datetime format.
date_mask = (df['datetime'] >= start_shift) & (df['datetime'] <= end_shift)
Locate all values in df that fit this range.
df_shift = df.loc[date_mask]
Count the number of instances in the new df_shift.
num_surgeries = len(df_shift.index())
Cycle through all shifts.
def count_shifts(df, shift, results_df, start_shift, end_shift):
date_mask = (df['datetime'] >= start_shift) & (df['datetime'] <= end_shift)
df_shift = df.loc[date_mask]
num_surgeries = len(df_shift.index())
return(num_surgeries)
# iterates through df and applies the above function to every row
results_df['num_surgeries'] = results_df.apply(calculate_num_surgeries,axis=1)
Also remember to name variables according to PEP8 Style Guide! Camelcase is not recommended in Python.

Dealing with subsets of data using csv.DictReader

I'm parsing a big CSV file using csv.DictReader.
quotes=open( "file.csv", "rb" )
csvReader= csv.DictReader( quotes )
Then for each row I'm converting the time value in the CSV in datetime using this :
for data in csvReader:
year = int(data["Date"].split("-")[2])
month = strptime(data["Date"].split("-")[1],'%b').tm_mon
day = int(data["Date"].split("-")[0])
hour = int(data["Time"].split(":")[0])
minute = int(data["Time"].split(":")[1])
bars = datetime.datetime(year,month,day,hour,minute)
Now I would like to perform actions only on the rows of the same day. Would it be possible to do it in the same for loop or should I maybe save the data out per day and then perform actions? What would be an efficient way of baking the parsing?

As jogojapan has pointed out, it is important to know whether we can assume that the CSV file is sorted by date. If it is, then you could use itertools.groupby to simplify your code. For example, the for loop in this code iterates over the data one day at time:
import csv
import datetime
import itertools
with open("file.csv", "rb") as quotes:
csvReader = csv.DictReader(quotes)
lmb = lambda d: datetime.datetime.strptime(d["Date"], "%d-%b-%Y").date()
for k, g in itertools.groupby(csvReader, key = lmb):
# do stuff per day
counts = (int(data["Count"]) for data in g)
print "On {0} the total count was {1}".format(k, sum(counts))
I created a test "file.csv" containing the following data:
Date,Time,Count
1-Apr-2012,13:23,10
2-Apr-2012,10:57,5
2-Apr-2012,11:38,23
2-Apr-2012,15:10,1
3-Apr-2012,17:47,123
3-Apr-2012,18:21,8
and when I ran the above code I got the following results:
On 2012-04-01 the total count was 10
On 2012-04-02 the total count was 29
On 2012-04-03 the total count was 131
But remember that this will only work if the data in "file.csv" is sorted by date.

If (for some reason) you can assume that the input rows are already sorted by date, you could put them into a local container one by one as long as the date of any new row is the same as the previous one:
same_date_rows = []
prev_date = None
for data in csvReader:
# ... your existing code
bars = datetime.datetime(year,month,day,hour,minute)
if bars == prev_date:
same_date_rows.append(data)
else:
# New date. We process all rows collected so far
do_something(same_date_rows)
# Then we start a new collection for the new date
same_date_rows = [date]
# Remember the date of the current row
prev_date = bars
# Finally, process the final group of rows
do_something(same_date_rows)
But if you cannot make that assumption, you will have to
Either: Put the rows in a long list, sort that by date, and then apply an algorithm like the above to the sorted list
Or: Put the rows in a dictionary, using the date as key, and a list of rows as value for each key. Then you can iterate through the keys of that dictionary to get access to all rows that share a date.
The second of these two approaches is a little more space-consuming, but it may allow you do to some of the date-specific processing in the main loop, because whenever you receive a new row for an already-existing date, you could apply some of the date-specific processing right away, possibly avoiding the need to actually store all date-specific rows explicitly. Whether that is possible depends on what kind of processing you apply to the rows.

If you are not going for space efficeny, an elegant solution would be to create a dictionary where the key is your day, and the value is a list object, where all the information for each day is stored. Later you can do whatever operations you want based on per day.
For example
d = {} #Initialize emptry dictionry
for data in csvReader:
Day = int(data["Date"].split("-")[0])
try:
d[Day].append('Some_Val')
except KeyError:
d[Day] = ['Some_val']
This will either modify or create a new list object for each day. This is later easily accessible either by iterating over the dictionary or simply referring to the day as a key.
For example:
d[Some_Day]
will give you simply a list object with all the information you have stored. Given the linear lookup time of a dictionary, it should be quite efficent in terms of time.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Take intersection of data-frames in a for loop - python

Related

calculate descriptive statistics in pandas dataframe based on a condition cyclically

A dataframe splitting problem in Pandas, any thoughts?

How to create an ID starting from 1 that increases by 1 every time the previous row of another column is a specific value

What is the most efficient way to count the number of instances occurring within a time frame in python?

Dealing with subsets of data using csv.DictReader

Categories

Resources