I currently have data in the following format in a dataframe:
metric__name sample sample_date
0 ga:visitBounceRate 100 2012-11-13
1 ga:uniquePageviews 20 2012-11-13
2 ga:newVisits 19 2012-11-13
3 ga:visits 20 2012-11-13
4 ga:percentNewVisits 95 2012-11-13
5 ga:pageviewsPerVisit 1 2012-11-13
6 ga:pageviews 20 2012-11-13
7 ga:visitBounceRate 72 2012-11-14
8 ga:uniquePageviews 63 2012-11-14
9 ga:newVisits 39 2012-11-14
That being said, I am trying to break out the metric__name column into something like this.
ga:visitBounceRate ga:uniquePageviews ga:newVisits etc...
sample_date
2012-11-13 100 20 19 etc...
I am doing the following to get my desired result.
df.pivot(index='sample_dates', columns='metric__name', values='samples')
All I keep getting is index contains multiple values which it indeed does, but why wouldn't it understand that there are similar and map them to the same line as I did in my desired output?
Use pivot_table (which doesn't throw this exception):
In [11]: df.pivot_table('sample', 'sample_date', 'metric__name')
Out[11]:
metric__name ga:newVisits ga:pageviews ga:pageviewsPerVisit ga:percentNewVisits ga:uniquePageviews ga:visitBounceRate ga:visits
sample_date
2012-11-13 19 20 1 95 20 100 20
2012-11-14 39 NaN NaN NaN 63 72 NaN
It accepts an aggregation function (by default is mean):
aggfunc : function, default numpy.mean, or list of functions
If list of functions passed, the resulting pivot table will have hierarchical columns
whose top level are the function names (inferred from the function objects themselves)
Regarding the difference between the two, I think pivot just does reshaping (and throws an error if there is a problem), whereas pivot_table offers more advanced functionality, aka "spreadsheet-style pivot tables".
Related
I'm working on a large data with more than 60K rows.
I have continuous measurement of current in a column. A code is measured for a second where the equipment measures it for 14/15/16/17 times, depending on the equipment speed and then the measurement moves to the next code and again measures for 14/15/16/17 times and so forth.
Every time measurement moves from one code to another, there is a jump of more than 0.15 on the current measurement
The data with top 48 rows is as follows,
Index
Curr(mA)
0
1.362476
1
1.341721
2
1.362477
3
1.362477
4
1.355560
5
1.348642
6
1.327886
7
1.341721
8
1.334804
9
1.334804
10
1.348641
11
1.362474
12
1.348644
13
1.355558
14
1.334805
15
1.362477
16
1.556172
17
1.542336
18
1.549252
19
1.528503
20
1.549254
21
1.528501
22
1.556173
23
1.556172
24
1.542334
25
1.556172
26
1.542336
27
1.542334
28
1.556170
29
1.535415
30
1.542334
31
1.729109
32
1.749863
33
1.749861
34
1.749861
35
1.736024
36
1.770619
37
1.742946
38
1.763699
39
1.749861
40
1.749861
41
1.763703
42
1.756781
43
1.742946
44
1.736026
45
1.756781
46
1.964308
47
1.957395
I want to write a script where similar data of 14/15/16/17 times is averaged in a separate column for each code measurement .. I have been thinking of doing this with pandas..
I want the data to look like
Index
Curr(mA)
0
1.34907
1
1.54556
2
1.74986
Need some help to get this done. Please help
First get the indexes of every row where there's a jump. Use Pandas' DataFrame.diff() to get the difference between the value in each row and the previous row, then check to see if it's greater than 0.15 with >. Use that to filter the dataframe index, and save the resulting indices (in the case of your sample data, three) in a variable.
indices = df.index[df['Curr(mA)'].diff() > 0.15]
The next steps depend on if there are more columns in the source dataframe that you want in the output, or if it's really just curr(mA) and index. In the latter case, you can use np.split() to cut the dataframe into a list of dataframes based on the indexes you just pulled. Then you can go ahead and average them in a list comphrension.
[df['Curr(mA)'].mean() for df in np.split(df, indices)]
> [1.3490729374999997, 1.5455638666666667, 1.7498627333333332, 1.9608515]
To get it to match your desired output above (same thing but as one-column dataframe rather than list) convert the list to pd.Series and reset_index().
pd.Series(
[df['Curr(mA)'].mean() for df in np.split(df, indices)]
).reset_index(drop=True)
index 0
0 0 1.349073
1 1 1.545564
2 2 1.749863
3 3 1.960851
I have a data frame which looks like this:
student_id
session_id
reading_level_id
st_week
end_week
1
3334
3
3
3
1
3335
2
4
4
2
3335
2
2
2
2
3336
2
2
3
2
3337
2
3
3
2
3339
2
3
4
...
There are multiple session_id's, st_weeks and end_weeks for every student_id. Im trying to group the data by 'student_id' and I want to calculate the difference between the maximum(end_week) and the minimum (st_week) for each student.
Aiming for an output that would look something like this:
Student_id
Diff
1
1
2
2
....
I am relatively new to Python as well as Stack Overflow and have been trying to find an appropriate solution - any help is appreciated.
Using the data you shared, a simpler solution is possible:
Group by student_id, and pass False argument to the as_index parameter (this works for a dataframe, and returns a dataframe);
Next, use a named aggregation to get the `max week for end week and the min week for st_week for each group
Get the difference between max_wk and end_wk
Finally, keep only the required columns
(
df.groupby("student_id", as_index=False)
.agg(max_wk=("end_week", "max"), min_wk=("st_week", "min"))
.assign(Diff=lambda x: x["max_wk"] - x["min_wk"])
.loc[:, ["student_id", "Diff"]]
)
student_id Diff
0 1 1
1 2 2
There's probably a more efficient way to do this, but I broke this into separate steps for the grouping to get max and min values for each id, and then created a new column representing the difference. I used numpy's randint() function in this example because I didn't have access to a sample dataframe.
import pandas as pd
import numpy as np
# generate dataframe
df = pd.DataFrame(np.random.randint(0,100,size=(1200, 4)), columns=['student_id', 'session_id', 'st_week', 'end_week'])
# use groupby to get max and min for each student_id
max_vals = df.groupby(['student_id'], sort=False)['end_week'].max().to_frame()
min_vals = df.groupby(['student_id'], sort=False)['st_week'].min().to_frame()
# use join to put max and min back together in one dataframe
merged = min_vals.join(max_vals)
# use assign() to calculate difference as new column
merged = merged.assign(difference=lambda x: x.end_week - x.st_week).reset_index()
merged
student_id st_week end_week difference
0 40 2 99 97
1 23 5 74 69
2 78 9 93 84
3 11 1 97 96
4 97 24 88 64
... ... ... ... ...
95 54 0 96 96
96 18 0 99 99
97 8 18 97 79
98 75 21 97 76
99 33 14 93 79
You can create a custom function and apply it to a group-by over students:
def week_diff(g):
return g.end_week.max() - g.st_week.min()
df.groupby("student_id").apply(week_diff)
Result:
student_id
1 1
2 2
dtype: int64
As part of my ongoing quest to get my head around pandas I am confronted by a surprise series. I don't understand how and why the output is a series - I was expecting a dataframe. If someone could explain what is happening here it would be much appreciated.
ta, Andrew
Some data:
hash email date subject subject_length
0 65319af6e jbrockmendel#gmail.com 2020-11-28 REF-IntervalIndex._assert_can_do_setop-38112 44
1 0bf58d8a9 simonjayhawkins#gmail.com 2020-11-28 DOC-add-contibutors-to-1.2.0-release-notes-38132 48
2 d16df293c 45562402+rhshadrach#users.noreply.github.com 2020-11-28 TYP-Add-cast-to-ABC-Index-like-types-38043 42
...
Some Code:
def my_function(row):
output = row['email'].value_counts().sort_values(ascending = False).head(3)
return output
top_three = dataframe.groupby(pd.Grouper(key='date', freq='1M')).apply(my_function)
Some Output:
date
2020-01-31 jbrockmendel#gmail.com 159
50263213+MomIsBestFriend#users.noreply.github.com 44
TomAugspurger#users.noreply.github.com 41
...
2020-10-31 jbrockmendel#gmail.com 170
2658661+dsaxton#users.noreply.github.com 23
61934744+phofl#users.noreply.github.com 21
2020-11-30 jbrockmendel#gmail.com 134
61934744+phofl#users.noreply.github.com 36
41443370+ivanovmg#users.noreply.github.com 19
Name: email, dtype: int64
It depends on what your Groupby is returning.
In your case, you are applying a function on row['email'] and returning a single value_counts, while all other columns in your data are part of index. A reset_index() would therefore give you what you need. Meaning, you are returning a multi-index single column output after groupby, which will be returned as a Series instead of a DataFrame.
For more clarity on which data structure is returned, we can do a toy experiment.
For example, for the first case, the apply function is applying the lambda function on groups where each group contains a dataframe (check [i for i in df.groupby(['a'])] to see what each group contains.
df = pd.DataFrame({'a':[1,1,2,2,3], 'b':[4,5,6,7,8]})
print(df.groupby(['a']).apply(lambda x:x**2))
#dataframe
a b
0 1 16
1 1 25
2 4 36
3 4 49
4 9 64
For the second case, we are only applying the lambda function on a series object OR only a single series is being returned. In this case, it doesn't return a dataframe and instead returns a series.
print(df.groupby(['a'])['b'].apply(lambda x:x**2))
#series
0 16
1 25
2 36
3 49
4 64
Name: b, dtype: int64
This can be solved simply by -
print(df.groupby(['a'])[['b']].apply(lambda x:x**2))
#dataframe
b
0 16
1 25
2 36
3 49
I have 3 days of time series data with multiple columns in it. I have one single DataFrame which includes all 3 days data. I want 3 different DataFrames based on Column name "Dates" i.e df["Dates"]
For Example:
Available Dataframe is: df
Expected Output: Based on Three different Dates
First DataFrame: df_23
Second DataFrame: df_24
Third DataFrame: df_25
I want to use these all three DataFrames separately for analysis.
I tried below code but I am not able to use those three dataframes (Rather I don't know how to use.) Can anybody help me to work my code better. Thank you.
Above code is just printing the DataFrame in three DataFrames that too not as expected as per code!
Unsure if your saving your variable into a csv or keep it in memory for further use,
you could pass each unique value into a dict and access by it's value :
print(df)
Cal Dates
0 85 23
1 75 23
2 74 23
3 97 23
4 54 24
5 10 24
6 77 24
7 95 24
8 58 25
9 53 25
10 44 25
11 94 25
d = {}
for frame, data in df.groupby('Dates'):
d[f'df{frame}'] = data
print(d['df23'])
Cal Dates
0 85 23
1 75 23
2 74 23
3 97 23
edit updated request :
for k,v in d.items():
i = (v['Cal'].loc[v['Cal'] > 70].count())
print(f"{v['Dates'].unique()[0]} --> {i} times")
23 --> 4 times
24 --> 2 times
25 --> 1 times
I have this code:
import pandas as pd
data = pd.read_csv("test.csv", sep=",")
data array looks like that:
The problem is that I can't split it by columns, like that:
week = data[:,1]
It should split the second column into the week, but it doesn't do it:
*TypeError: unhashable type: 'slice'
*
How should I do this to make it work?
I also wondering, that what this code do exactly? (Don't really understand np.newaxis part)
week = data['1'][:, np.newaxis]
Result:
There are a few issues here.
First, read_csv uses a comma as a separator by default, so you don't need to specify that.
Second, the pandas csv reader by default uses the first row to get column headings. That doesn't appear to be what you want, so you need to use the header=None argument.
Third, it looks like your first column is the row number. You can use index_col=0 to use that column as the index.
Fourth, for pandas, the first index is the column, not the row. Further, using the standard data[ind] notation is indexing by column name, rather than column number. And you can't use a comma to index both row and column at the same time (you need to use data.loc[row, col] to do that).
So for your case, all you need to do to get the second columns is data[2], or if you use the first column as the row number then the second column becomes the first column, so you would do data[1]. This returns a pandas Series, which is the 1D equivalent of a 2D DataFrame.
So the whole thing should look like this:
import pandas as pd
data = pd.read_csv('test.csv', header=None, index_col=0)
week = data[1]
data looks like this:
1 2 3 4
0
1 10 2 100 12
2 15 5 150 15
3 25 7 240 20
4 22 12 350 20
5 51 13 552 20
6 134 20 880 36
7 150 22 900 38
8 200 29 1020 44
9 212 31 1100 46
10 199 23 1089 45
11 220 32 1145 60
The '0' row doesn't exist, it is just there for informational purposes.
week looks like this:
0
1 10
2 15
3 25
4 22
5 51
6 134
7 150
8 200
9 212
10 199
11 220
Name: 1, dtype: int64
However, you can give columns (and rows) meaningful names in pandas, and then access them by those names. I don't know the column names, so I just made some up:
import pandas as pd
data = pd.read_csv('test.csv', header=None, index_col=0, names=['week', 'spam', 'eggs', 'grail'])
week = data['week']
In this case, data looks like this:
week spam eggs grail
1 10 2 100 12
2 15 5 150 15
3 25 7 240 20
4 33 12 350 20
5 51 13 552 20
6 134 20 880 36
7 150 22 900 38
8 200 29 1020 44
9 212 31 1100 46
10 199 23 1089 45
11 220 32 1145 50
And week looks like this:
1 10
2 15
3 25
4 33
5 51
6 134
7 150
8 200
9 212
10 199
11 220
Name: week, dtype: int64
For np.newaxis, what that does is add one dimension to the array. So say you have a 1D array (a vector), using np.newaxis on it would turn it into a 2D array. It would turn a 2D array into a 3D array, 3D into 4D, and so on. Depending on where you put it (such as [:,np.newaxis] vs. [np.newaxis,:], you can determine which dimension to add. So np.arange(10)[np.newaxis,:] (or just np.arange(10)[np.newaxis]) gives you a shape (1,10) 2D array, while np.arange(10)[:,np.newaxis] gives you a shape (10,1) 2D array.
In your case, what the line is doing is getting the column with the name 1, which is a 1D pandas Series, then adding a new dimension to it. However, instead of turning it back into a DataFrame, it instead converts it into a 1D numpy array, then adds one dimension to make it a 2D numpy array.
This, however, is dangerous long-term. There is no guarantee that this sort of silent conversion won't be changed at some point. To change a pandas objects to a numpy one, you should use an explicit conversion with the values method, so in your cases data.values or data['1'].values.
However, you don't really need a numpy array. A series is fine. If you really want a 2D object, you can convert a Series into a DataFrame by using something like data['1'].to_frame().