An example of my df:
Index time type pwa0 pwa1 pwa2
63 16:05:03 nonJ [20:733:845] [] [2750]
I would like to split the columns having no, one or multiple values (pwa0, pwa1 and pwa2) like this:
Index time type pwa0 pwa1 pwa2
63 16:05:03 nonJ 20 2750
63 16:05:03 nonJ 733
63 16:05:03 nonJ 845
In contrast to the possible shown duplicate, the to be splitted columns are not correlated. The split order per column should just be order based: first the first value, then the second etc. If no values are present in a column, the cell should remain empty. Any suggestions will be highly appreciated!
Related
I have a datafrme df1 like below: lat-long can be duplicate
miles uid lat_long
12 235 (45,67)
13 234 (41.09,67)
14 233 (34,55)
15 236 (12.23,65.78)
16 239 (27,34)
I want remove the entry from df1 if the lat_long value is invalid.I am doing this like below but taking too much time.
all_lat_long = df1["lat_long"].tolist(). #list of tuples
def lat_long_check(each_coordnts):
match = re.match('^\((?P<lat>-?\d*(.\d+)),(?P<long>-?\d*(.\d+))\)$',
str(each_coordnts)) #find invalid lat-long
if match is None:
idx = df1[df1['lat_long'] == each_coordnts].index
df1.drop(idx,inplace=True)
for each_coordnts in all_lat_long:
lat_long_check(each_coordnts)
Is there any efficient way to do this for 1M records? Once wrong lat-long entries are removed, I want populate two new columns at the end of df1-"Latitude" and "Longitude" and populate corresponding values.
I would proceed as follows:
Define a function validate_lat_long that returns a tuple of floats if the latitude/longitude values are correct. I assume this has to do with checking that the values are within expected intervals (-90 to 90 for latitude, etc). The function should return np.nan if the values are not correct.
Create a new column with correct values as follows:
df1["validated_lat_long"] = df1["lat_long"].apply(validate_lat_long)
Finally, in order to remove invalid values, use dropna on the new column and possibly make a new dataframe if you need to preserve the previous work:
new_df = df1.dropna(subset=["validated_lat_long"])
Your code is most probably slow because it iterates on dataframe rows. Applying a function with df.apply() should speed things up reasonably. Also I hope you can check floats instead of searching for a regex.
I have a dataframe that contains three series called Date, Element,
and Data_Value--their types are string, string, and numpy.int64
respectively. Date has dates in the form of yyyy-mm-dd; Element has
strings that say either TMIN or TMAX, and it denotes whether the
Data_Value is the minimum or maximum temperature of a particular date;
lastly, the Data_Value series just represents the actual temperature.
The date series has multiple duplicates of the same date. E.g. for the
date 2005-01-01, there are 19 entries for the temperature column, the
values start at 28 and go all the way up to 156. I want to create a
new dataframe with the date and the maximum temperature only--I'll
eventually want one for TMIN values too, but I figure that if I can do
one I can figure out the other. I'll post some psuedocode with
explanation below to show what I've tried so far.
So far I have pulled in the csv and assigned it to a variable, df.
Then I sorted the values by Date, Element and Temperature
(Data_Value). After that, I created a variable called tmax that grabs
the necessary dates (I only need the data from 2005-2014) that have
'TMAX' as its Element value. I cast tmax into a new DataFrame, reset
its index to get rid of the useless index data from the first
dataframe, and dropped the 'Element' column since it was redundant at
this point. Now I'm (ultimately) trying to create a list of all the
Temperatures for TMAX so that I can plot it with pyplot. But I can't
figure out for the life of me how to reduce the dataframe to just the
single date and max value for that date. If I could just get that then
I could easily convert the series to a list and plot it.
def record_high_and_low_temperatures():
#read in csv
df = pd.read_csv('somedata.csv')
#sort values so they're in a nice order
df.sort_values(by=['Date', 'Element', 'Data_Value'], inplace=True)
# grab all entries for TMAX in correct date range
tmax = df[(df['Element'] == 'TMAX') & (df['Date'].between("2005-01-01", "2014-12-31"))]
# cast to dataframe
tmax = pd.DataFrame(tmax, columns=['Date', 'Data_Value'])
# Remove index column from previous dataframe
tmax.reset_index(drop=True, inplace=True)
# this is where I'm stuck, how do I get the max value per unique date?
max_temp_by_date = tmax.loc[tmax['Data_Value'].idxmax()]
Any and all help is appreciated, let me know if I need to clarify anything.
TL;DR:
Ok...
input dataframe looks like
date | data_value
2005-01-01 28
2005-01-01 33
2005-01-01 33
2005-01-01 44
2005-01-01 56
2005-01-02 0
2005-01-02 12
2005-01-02 30
2005-01-02 28
2005-01-02 22
Expected df should look like:
date | data_value
2005-01-01 79
2005-01-02 90
2005-01-03 88
2005-01-04 44
2005-01-05 63
I just want a dataframe that has each unique date coupled with the highest temperature on that day.
If I understand you correctly, what you would want to do is as Grzegorz already suggested in the comments, is to groupby date (take all elements of one date) and then take the maximum of that date:
df.groupby('date').max()
This will take all your groups and reduce them to only one row, taking the maximum element of every group. In this case, max() is called the aggregation function of the group. As you mentioned that you will also need the minimum at some point, a nice way to do this (instead of two groupbys) is to do the following:
df.groupby('date').agg(['max', 'min'])
which will pass over all groups once and apply both aggregation functions max and min returning two columns for each input column. More documentation on aggregation is here.
Try this:
df.groupby("Date")['data_value'].max()
I have a pandas dataframe that looks like:
cleanText.head()
source word count
0 twain_ess 988
1 twain_ess works 139
2 twain_ess short 139
3 twain_ess complete 139
4 twain_ess would 98
5 twain_ess push 94
And a dictionary that contains the total word count for each source:
titles
{'orw_ess': 1729, 'orw_novel': 15534, 'twain_ess': 7680, 'twain_novel': 60004}
My goal is to normalize the word counts for each source by the total number of words in that source, i.e. turn them into a percentage. This seems like it should be trivial but python seems to make it very difficult (if anyone could explain the rules for inplace operations to me that would be great).
The caveat comes from needing to filter the entries in cleanText to just those from a single source, and then I attempt to inplace divide the counts for this subset by the value in the dictionary.
# Adjust total word counts and normalize
for key, value in titles.items():
# This corrects the total words for overcounting the '' entries
overcounted= cleanText[cleanText.iloc[:,0]== key].iloc[0,2]
titles[key]= titles[key]-overcounted
# This is where I divide by total words, however it does not save inplace, or at all for that matter
cleanText[cleanText.iloc[:,0]== key].iloc[:,2]= cleanText[cleanText.iloc[:,0]== key]['count']/titles[key]
If anyone could explain how to alter this division statement so that the output is actually saved in the original column that would be great.
Thanks
If I understand Correctly:
cleanText['count']/cleanText['source'].map(titles)
Which gives you:
0 0.128646
1 0.018099
2 0.018099
3 0.018099
4 0.012760
5 0.012240
dtype: float64
To re-assign these percentage values into your count column, use:
cleanText['count'] = cleanText['count']/cleanText['source'].map(titles)
My goal is that by given a value on a row (let's say 3), look for the value of a given column 3 rows below. Currently I am perfoming this using for loops but it is tremendously inefficient.
I have read that vectorizing can help to solve this problem but I am not sure how.
My data is like this:
Date DaysToReception Quantity QuantityAtTheEnd
20/03 3 102
21/03 - 88
22/03 - 57
23/03 5 178
24/03
And I want to obtain:
Date DaysToReception Quantity QuantityAtReception
20/03 3 102 178
21/03 - 88
22/03 - 57
23/03 5 178
24/03
...
Thanks for your help!
If you have unique date or DaysToReception, you can actually use Map/HashMap where the key will be the date or DaysToReception and the values will be other information which you can store using a list or any other appropriate data structure.
This will definitely improve the efficiency.
As you pointed out that "number of rows I search the value below depends on the value "DaysToReception", I believe "DaysToReception" will not be unique. In that case, the key to your Map will be the date.
The easiest way I can think of to do this in pandas is the following:
# something like your dataframe
df = pd.DataFrame(dict(date=['20/03', '21/03', '22/03', '23/03'],
days=[3, None, None, 5,],
quant=[102, 88, 57, 178]))
# get the indexs of all days that aren't missing
idxs = df.index[~pd.isnull(df.days)]
# get number of days to go
values = df.days[idxs].values.astype(int)
# get index of three days ahead
new_idxs = idxs+values
# create a blank column
df['quant_end'] = None
# Now fill it with the data we're after
df.quant_end[idxs] = df.quant[new_idxs]
I have a pandas dataframe with a two-element hierarchical index ("month" and "item_id"). Each row represents a particular item at a particular month, and has columns for several numeric measures of interest. The specifics are irrelevant, so we'll just say we have column X for our purposes here.
My problem stems from the fact that items vary in the months for which they have observations, which may or may not be contiguous. I need to calculate the average of X, across all items, for the 1st, 2nd, ..., n-th month in which there is an observation for that item.
In other words, the first row in my result should be the average across all items of the first row in the dataframe for each item, the second result row should be the average across all items of the second observation for that item, and so on.
Stated another way, if we were to take all the date-ordered rows for each item and index them from i=1,2,...,n, I need the average across all items of the values of rows 1,2,...,n. That is, I want the average of the first observation for each item across all items, the average of the second observation across all items, and so on.
How can I best accomplish this? I can't use the existing date index, so do I need to add another index to the dataframe (something like I describe in the previous paragraph), or is my only recourse to iterate across the rows for each item and keep a running average? This would work, but is not leveraging the power of pandas whatsoever.
Adding some example data:
item_id date X DUMMY_ROWS
20 2010-11-01 16759 0
2010-12-01 16961 1
2011-01-01 17126 2
2011-02-01 17255 3
2011-03-01 17400 4
2011-04-01 17551 5
21 2007-09-01 4 6
2007-10-01 5 7
2007-11-01 6 8
2007-12-01 10 9
22 2006-05-01 10 10
2006-07-01 13 11
23 2006-05-01 2 12
24 2008-01-01 2 13
2008-02-01 9 14
2008-03-01 18 15
2008-04-01 19 16
2008-05-01 23 17
2008-06-01 32 18
I've added a dummy rows column that does not exist in the data for explanatory purposes. The operation I'm describing would effectively give the mean of rows 0,6,10,12, and 13 (the first observation for each item), then the mean of rows 1,7,11,and 15 (the second observation for each item, excluding item 23 because it has only one observation), and so on.
One option is to reset the index then group by id.
df_new = df.reset_index()
df_new.groupby(['item_id']).X.agg(np.mean)
this leaves your original df intact and gets you the mean across all months for each item id.
For your updated question (great example by the way) I think the approach would be to add an "item_sequence_id" I've done this in the path with similar data.
df.sort(['item_id', 'date'], inplace = True)
def sequence_id(item):
item['seq_id'] = range(0,len(item)-1,1)
return item
df_with_seq_id = df.groupby(['item_id']).apply(sequence_id)
df_with_seq_id.groupby(['seq_id']).agg(np.mean)
The idea here is that the seq_id allows you to identify the position of the data point in time per item_id assigning non-unique seq_id values to the items will allow you to group across multiple items. The context I've used this in before relates to users doing something first in a session. Using this ID structure I can identify all of the first, second, third, etc... actions taken by users regardless of their absolute time and user id.
Hopefully this is more of what you want.
Here's an alternative method for this I finally figured out (which assumes we don't care about the actual dates for the purposes of calculating the mean). Recall the method proposed by #cwharland:
def sequence_id(item):
item['seq'] = range(0,len(item),1)
return item
shrinkWithSeqID_old = df.groupby(level='item_id').apply(sequence_id)
Testing this on a 10,000 row subset of the data frame:
%timeit -n10 dfWithSeqID_old = shrink.groupby(level='item_id').apply(sequence_id)
10 loops, best of 3: 301 ms per loop
It turns out we can simplify things by remembering that pandas' default behavior (i.e. without specifying an index column) is to generate a numeric index for a dataframe numbered from 0 to n (the number of rows in the frame). We can leverage this like so:
dfWithSeqID_new = df.groupby(level='item_id').apply(lambda x: x.reset_index(drop=True))
The only difference in the output is that we have a new, unlabeled numeric index with the same content as the 'seq' column used in the previous answer, BUT it's almost 4 times faster (I can't compare the methods for the full 13 million row dataframe, as the first methods was resulting in memory errors):
%timeit -n10 dfWithSeqID_new = df.groupby(level='item_id').apply(lambda x: x.reset_index(drop=True))
10 loops, best of 3: 77.2 ms per loop
Calculating the average as in my original question is only slightly different. The original method was:
dfWithSeqID_old.groupby('seq').agg(np.mean).head()
But now we simply have to account for the fact that we're using the new unlabeled index instead of the 'seq' column:
dfWithSeqID_new.mean(level=1).head()
The result is the same.