Group pandas dataframe by a outside column - python

import pandas as pd
db = pd.read_csv('17base.csv')
order = pd.read_csv('order.csv')
db = db.groupby(db['code'])['r'].mean()
and the two dataframe is just like the table I drew here. But this code don't work, since I want to group every 50 "codes" here
I have a dataframe like this. I want to first group the code by the order of another dataframe which has only one column. Second, I want to construct portfolios every 50 codes, and "r" is basically the average of the same date (the date is in "year-week" format. The result is like the listed portforlio number, each date and "r". It seems a really complicated task for me and can anyone help me...
Here is a sample original table.
code
Date
1
2017-1
1
2017-2
1
2017-3
2
2017-1
2
2017-2
2
2017-3
...
...
3000
2017-3
And I have another table which is has the order code should be, like
code
6
7
564
...
That's what I can thought of to explain myself...

Related

Python Pandas: Create a dictionary from a dataframe with values equal to a list of all row values

I am trying to create dictionary from a dictionary from a dataframe in the following way.
My dataframe contains a column called station_id. The station_id values are unique. That is each row correspond to station id. Then there is another column called trip_id (see example below). Many stations can be associated with a single trip_id. For example
l1=[1,1,2,2]
l2=[34,45,66,67]
df1=pd.DataFrame(list(zip(l1,l2)),columns=['trip_id','station_name'])
df1.head()
trip_id station_name
0 1 34
1 1 45
2 2 66
3 2 67
I am trying to get a dictionary d={1:[34,45],2:[66,67]}.
I solved it with a for loop in the following fashion.
from tqdm import tqdm
Trips_Stations={}
Trips=set(df['trip_id'])
T=list(Trips)
for i in tqdm(range(len(Trips))):
c_id=T[i]
Values=list(df[df.trip_id==c_id].stop_id)
Trips_Stations.update({c_id:Values})
Trips_Stations
My actual dataset has about 65000 rows. The above takes about 2 minutes to run. While this is acceptable for my application, I was wondering if there is a faster way to do it using base pandas.
Thanks
somehow stackoverflow suggested that I look at Group_By
This is much faster
d=df.groupby('trip_id')['stop_id'].apply(list)
from collections import OrderedDict, defaultdict
o_d=d.to_dict(OrderedDict)
o_d=dict(o_d)
It took about 30 secs for the dataframe with 65000 rows. Then

Is there a way to create a Pandas dataframe where the values map to an index/row pair?

I was struggling with how to word the question, so I will provide an example of what I am trying to do below. I have a dataframe that looks like this:
ID CODE COST
0 60086 V2401 105.38
1 60142 V2500 221.58
2 60086 V2500 105.38
3 60134 V2750 35
4 60134 V2020 0
I am trying to create a dataframe that has the ID as rows, the CODE as columns, and the COST as values since the cost for the same code is different per ID. How can I do this in?
This seems like a classic "long to wide" problem, and there are several ways to do it. You can try pivot, for example:
df.pivot_table(index='ID', columns='CODE', values='COST')
(assuming that the dataframe is df.)

Calculating each specific occurrence using value_counts() in Python

I have the dataframe named Tasks, containing a column named UserName. I want to count every occurrence of a row containing the same UserName, therefore getting to know how many tasks a user has been assigned to. For a better understanding, here's how my dataframe looks like:
In order to achieve this, I used the code below:
Most_Involved = Tasks['UserName'].value_counts()
But this got me a DataFrame like this:
Index Username
John 4
Paul 1
Radu 1
Which is not exactly what I am looking for. How should I re-write the code in order to achieve this:
Most_Involved
Index UserName Tasks
0 John 4
1 Paul 1
2 Radu 1
You can use transform to add a new column to existing data frame:
df['Tasks'] = df.groupby('UserName')['UserName'].transform('size')
# finally select the columns needed
df = df[['Index','UserName','Tasks']]
you can find duplicate rows based on columns by using pandas.
duplicateRowsDF = dataframe[dataframe.duplicated(['columnName'])]
here is the complete solution

How do I extract variables that repeat from an Excel Column using Python?

I'm a beginner at Python and I have a school proyect where I need to analyze an excel document with information. It has aproximately 7 columns and more than 1000 rows.
Theres a column named "Materials" that starts at B13. It contains a code that we use to identify some materials. The material code looks like this -> 3A8356. There are different material codes in the same column they repeat a lot. I want to identify them and make a list with only one code, no repeating. Is there a way I can analyze the column and extract the codes that repeat so I can take them and make a new column with only one of each material codes?
An example would be:
12 Materials
13 3A8356
14 3A8376
15 3A8356
16 3A8356
17 3A8346
18 3A8346
and transform it toosomething like this:
1 Materials
2 3A8346
3 3A8356
4 3A8376
Yes.
If df is your dataframe, you only have to do df = df.drop_duplicates(subset=['Materials',], keep=False)
To load the dataframe from an excel file, just do:
import pandas as pd
df = pd.read_excel(path_to_file)
the subset argument indicates which column headings you want to look at.
Docs: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
For the docs, the new data frame with the duplicates dropped is returned so you can assign it to any variable you want. If you want to re_index the first column, take a look at:
new_data_frame = new_data_frame.reset_index(drop=True)
Or simply
new_data_frame.reset_index(drop=True, inplace=True)

Calculate Pandas dataframe column with replace function

I`m working on calculating a field in Pandas dataframe. Learning Python, I'm trying to find the best method.
Dataframe is quite big, over 55 mln rows. It has a few columns among which date and failure are in my interest. So the dataframe looks like this:
date failure
2018-09-09 0
2016-05-12 1
2013-12-12 1
2018-05-12 1
2018-05-12 1
I want to calculate failure_date (if failure = 1 then failure_date = date).
Tried smth. like this:
import pandas as pd
abc = pd.read_pickle('data_abc.pkl')
abc['failure_date'] = abc['failure'].replace(1, abc['date'])
The session is busy for a very long time (1.5h). No result so far. Is it a right approach?
Is the a more effective way of calculating column based on condition on others ?
This code adds a column "failure_date" and sets it to the failure date for the failures. It does not address "non-failures".
abc.loc[abc['failure']==1, 'failure_date'] = abc['date']
If you don't mind discarding the rest of the dataframe you could get all the dates where failure is 1 like this
abc = abc[abc['failure] == 1]

Categories