I'm working with 2 csv files.
In the performance file: I have historical data on loan performance (i.e. loan 110 performance from month 1 to 7, then loan 111 performance from month 1 to 20).in the first file I have columns as follows: A= loan id, B= performance month (1 to 7), C=default amount. for each loanid there is 1 row per month of performance.
I'm trying to create a loop that gives me the first month that each loan has a default and copy the month and default amount into my second csv file which has descriptive data on each loanid. the idea is to add 2 columns on the second file and for each loanid, to retrieve the month when they first have a default value.
i'm working on jupyter notebook and so far I've imported pandas library and read the performance csv file.
any guidance would be appreciated.
import pandas as pd
data = pd.read_csv(r'c:\users\guest1\documents\python_example_performance.csv',delimiter=',')
data.head()
First of all, I can't comment as I don't have enough reputation. I would need more clarification on the issue. Could you show how the data look like? It's a bit confusing for me between the 100, 101 and the rating 1-7 or 1-20.
Based on my current understanding, I would remove non-default value first from first CSV.
Since you're using Pandas, you can go through Loc.
The syntax generally looks like this.
df = df[df[cols] > 0]
If they're duplicate, then keep the latest month or current month depends on your choice. Pandas support drop duplicate and have the option of keeping the first or last record. The syntax generally looks like this.
df = df.drop_duplicates(subset ="Col1", keep = 'last')
For more documentation, please refer to : Pandas - Drop Duplicates
Lastly , you need to perform a Join for both Data Frames based on loan ID. The syntax generally looks like this.
df = pd.merge(df1, df2, how='left', on=['LoanID'])
For more documentation, please refer to : Pandas - Merge
Related
I have filtered a pandas data frame by grouping and taking sum, now I want all the details and no longer need the sum
for example what I have looks like the image below
what i want is for each of the individual transactions to be shown, as currently the amount column is the sum of all transactions done by an individual on a specific date i want to see all the individual amounts, is this possible?
I dont know how to filter the larger df by the groupby one, have also tried using isin() with multiple &s but it does not work as for example "David" could be in my groupby df on sept 15, but in the larger df he has made transactions on other days aswell and those are slipping through when using isin()
Hello there and welcome,
first of all, as I've learned my self, always try:
to give some data (in text, or code form) as your input
share your expected output, to avoid more questions
have fun :-)
I'm new as well, and I did my best to cover as much possibilities as I could, at least people can use my code to get your df.
#From the picture
data={'Date': ['2014-06-30','2014-07-02','2014-07-02','2014-07-03','2014-07-09','2014-07-14','2014-07-17','2014-07-25','2014-07-29','2014-07-29','2014-08-06','2014-08-11','2014-08-22'],
'LastName':['Cow','Kind','Lion','Steel','Torn','White','Goth','Hin','Hin','Torn','Goth','Hin','Hin'],
'FirstName':['C','J','K','J','M','D','M','G','G','M','M','G','G'],
'Vendor':['Jail','Vet','TGI','Dept','Show','Still','Turf','Glass','Sup','Ref','Turf','Lock','Brenn'],
'Amount': [5015.70,6293.27,7043.00,7600,9887.08,5131.74,5037.55,5273.55,9455.48,5003.71,6675,7670.5,8698.18]
}
df=pd.DataFrame(data)
incoming=df.groupby(['Date','LastName','FirstName','Vendor','Amount']).count()
#what I believe you did to get Date grouped
incoming
Now here my answer:
Firstly I merged First and Lastname
df['CompleteName']=df[['FirstName','LastName']].agg('.'.join,axis=1) # getting Names for df
Then I did some statistics for the amount, for different groups:
#creating a new column with as much Statistics from group (Complete Name, Date, Vendor, etc.)
df['AmountSumName']=df['Amount'].groupby(df['CompleteName']).transform('sum')
df['AmountSumDate']=df['Amount'].groupby(df['Date']).transform('sum')
df['AmountSumVendor']=df['Amount'].groupby(df['Vendor']).transform('sum')
df
Now just groupby as you wish
Hope I could answer you question.
I have a list of company names, dates, and pe ratios.
I need to find an average of the previous 10 years data of the given date such that only month-end date is considered.
for example if I need to find average as of 31st dec, 2015..... I need to first find data of all previous month ends from 31/12/2005 to 31/12/2015. and then their average.
sample data I have
required output:
required output
here is what I have done soo far....
df = pd.read_csv('daily_valuation_ratios_cc.csv')
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
columns = ['pe', 'price_bv', 'mcap_ns', 'ev_ebidta']
df_mean = df.groupby('Company Name')[columns].resample('M').mean()
but this method is finding mean on daily basis and is showing result monthly, unlike my sample output.
i am new to pandas, pls help.
Edit:
df3 = df.groupby(['Company Name','year','month'])
df3.first()
this code works, now I just have one problem, to export dataframe to to_csv. pls help
A dataframe has a special function called groupby that selects a column, and can be aggregated.
So if you were to run, data.groupby('pe') you would get that column.
Now if you were to tack on .describe, you would get the standard deviation/mean/min/ect.
Example:
data.groupby('pe').describe()
Edit: You can also use built-in aggregate functions such as .max()/.mean()/ect. with groupby().
Sorry about my question, but I tried some solutions but I couldn't the right answer. I'm working with the Airbnb Boston data base and I would like to groupy by listing_id in the calendar data base and after to get the rows with minimum price and price different from 0.0.
The data base has 1308890 rows and 4 columns. There are 3585 unique listing_id.
dfc_calendar[(dfc_calendar['available'] == True)].groupby('listing_id')['price'].min()
Using isin commando comparing listing_id take a long period and stops with error after a long period. When I try to get the indexes after the groupby I got listing_id values and I need the indexes of the rows. How can I do it?
Thank you!
Not sure I got you. Shout if I got it wrong because I am not clear what difference with 0.0 means
Data
import pandas as pd
df=pd.DataFrame({'listing_id':['12345','12349','12345','12349','12345'], 'Price':[3,5,67,7,12]})
df['date'] = pd.date_range(start='1/2/2020', periods=len(df), freq='D')
df
Can go
df.groupby('listing_id')['Price'].min()
Or
df['MinPrice']=df.groupby('listing_id')['Price'].transform('min')
df
If you wanted to add availability in grouping. please try
df['MinPrice']=df.groupby('listing_id', 'available')['Price'].transform('min')
df
Or
df.loc[df.groupby('listing_id')['Price'].idxmin()]
I have a very simple dataframe with columns: Index, Person, Item, Date. There are only 4 people and 3 items and random dates. All person/item/date combinations are unique. I am trying to get a simple pivot-table like df to print using:
import pandas as pd
mydf = pd.read_csv("Test_Data.csv",index_col=[0])
mydf = mydf.sort_values(by=['Date','Item','Person'], ascending=False)
print(mydf.groupby(['Person','Item'])['Date'].max())
however, I noticed that while the structure is what I want, the data is not. It is not returning the max date for the Person/Item combination. I thought sorting things first would help, but it did not. Do I need to create a temp df first and then join to do what I'm trying to do?
Also to be clear, there are 28 rows of data (all test data) with some People/Items being repeated but with different dates. Index is just 0 through 27.
Figured it out! Should have made sure the Date field was actually recognized as a date:
mydf['Date'] = pd.to_datetime(mydf['Date'])
I'm working with a crypto-currency data sample, each cell contains a dictionary. The dictionary containing the open price, close price, highest price, lowest price, volume and market cap. The columns are the corresponding dates and the index is the name of each cryptocurrency.
I don't know how to prepare the data in order for me to find the correlation between different currencies and between highest price and volume for example. How can this be done in python (pandas)...also how would I define a date range in such a situation?
Here's a link to the data sample, my coding and a printout of the data (Access is OPEN TO PUBLIC): https://drive.google.com/open?id=1mjgq0lEf46OmF4zK8sboXylleNs0zx7I
To begin with, I would suggest rearranging your data so that each currency's OHLCV values are their own columns (e.g. "btc_open | btc_high" etc.). This makes generating correlation matrices far easier. I'd also suggest beginning with only one metric (e.g. close price) and perhaps period movement (e.g. close-open) in your analysis. To answer your question:
Pandas can return a correlation matrix of all columns with:
df.corr()
If you want to use only specific columns, select those from the DataFrame:
df[["col1", "col2"]].corr()
You can return a single correlation value between two columns with the form:
df["col1"].corr(df["col2"])
If you'd like to specify a specific date range, I'd refer you to this question. I believe this will require your date column or index to be of the type datetime. If you don't know how to work with or convert to this type, I would suggest consulting the pandas documentation (perhaps begin with pandas.to_datetime).
In future, I would suggest including a data snippet in your post. I don't believe Google Drive is an appropriate form to share data, and it definitely is not appropriate to set the data to "request access".
EDIT: I checked your data and created a smaller subset to test this method on. If there are imperfections in the data you may find problems, but I had none when I tested it on a sample of your first 100 days and 10 coins (after transposing, df.iloc[:100, :10].
Firstly, transpose the DataFrame so columns are organised by coin and rows are dates.
df = df.T
Following this, we concatenate to a new DataFrame (result). Alternatively, concatenate to the original and drop columns after. Unfortunately I can't think of a non-iterative method. This method goes column by column, creates a DataFrame for each coins, adds the coin name prefix to the column names, then concatenates each DataFrame to the end.
result = pd.DataFrame()
coins = df.columns.tolist()
for coin in coins:
coin_data = df[coin]
split_coin = coin_data.apply(pd.Series).add_prefix(coin+"_")
result = pd.concat([result, split_coin], axis=1)