Compare columns (per row) of two DataFrames in Python

Compare columns (per row) of two DataFrames in Python - python

first of all, I'm quite new to programming overall (< 2 Months), so I'm sorry if that's an 'simple, no need to ask for help, try it yourself until you get it done' problem.
I have two data-frames with partially the same content (general overview of mobile-numbers including their cost centers in the company and monthly invoices with the affected mobile-numbers and their invoice amount).
I'd like to compare the content of the 'mobile-numbers' column of the monthly invoices DF to the content of the 'mobile-numbers' column of the general overview DF and if matching, assign the respective cost center to the mobile-number in the monthly invoices DF.
I'd love to share my code with you, but unfortunately I have absolutely zero clue how to solve that problem in any way.
Thanks
Edit: I'm from germany, I tried my best to explain the problem in english. If there is anything I messed up (so u dont get it) just tell me :)
Example of desired result

program meets your needs, in the second dataframe I put the value '40' to demonstrate that the dataframes already filled will not be zeroed, the replacement will only occur if there is a similar value between the dataframes, if you want a better explanation about the program , comment below, and don't forget to vote and mark as solved, I also put some 'prints' for a better view, but in general they are not necessary
import pandas as pd
general_df = pd.DataFrame({"mobile_number": [1234,3456,6545,4534,9874],
"cost_center": ['23F','67F','32W','42W','98W']})
invoice_df = pd.DataFrame({"mobile_number": [4534,5567,1234,4871,1298],
"invoice_amount": ['19,99E','19,99E','19,99E','19,99E','19,99E'],
"cost_center": ['','','','','40']})
print(f"""GENERAL OVERVIEW DF
{general_df}
________________________________________
INVOICE DF
{invoice_df}
_________________________________________
INVOICE RESULT
""")
def func(line):
t = 0
for x in range(0, len(general_df['mobile_number'])):
t = general_df.loc[general_df['mobile_number'] == line[0]]
if t.empty:
return line[2]
else:
return t.values.tolist()[0][1]
invoice_df['cost_center'] = invoice_df.apply(func, axis = 1)
print(invoice_df)

Related

Dropping Rows that Contain a Specific String wrapped in square brackets?

I'm trying to drop rows which contain strings that are wrapped in a column. I want to drop all values that contain the strings '[removed]', '[deleted]'.
My df looks like this:
Comments
1 The main thing is the price appreciation of the token (this determines the gains or losses more
than anything). Followed by the ecosystem for the liquid staking asset, the more opportunities
and protocols that accept the asset as collateral, the better. Finally, the yield for staking
comes into play.
2 [deleted]
3 [removed]
4 I could be totally wrong, but sounds like destroying an asset and claiming a loss, which I
believe is fraudulent. Like someone else said, get a tax guy - for this year anyway and then
you'll know for sure. Peace of mind has value too.
I have tried df[df["Comments"].str.contains("removed")==False]
But when i try to save the dataframe, it is still not removed.
EDIT:
My full code
import pandas as pd
sol2020 = pd.read_csv("Solana_2020_Comments_Time_Adjusted.csv")
sol2021 = pd.read_csv("Solana_2021_Comments_Time_Adjusted.csv")
df = pd.concat([sol2021, sol2020], ignore_index=True, sort=False)
df[df["Comments"].str.contains("deleted")==False]
df[df["Comments"].str.contains("removed")==False]

Try this
I have created a data frame for comments column and used my own comments but it should work for you
import pandas as pd
sample_data = { 'Comments': ['first comment whatever','[deleted]','[removed]','last comments whatever']}
df = pd.DataFrame(sample_data)
data = df[df["Comments"].str.contains("deleted|removed")==False]
print(data)
output I got
Comments
0 first comment whatever
3 last comments whatever

You can do it like this:
new_df = df[~(df['Comments'].str.startswith('[') & df['Comments'].str.endswith(']'))].reset_index(drop=True)
Output:
>>> new_df
Comments
0 The main thing is the price appreciation of th...
3 I could be totally wrong, but sounds like dest...
That will remove all rows where the value of the Comments column for that row starts with [ and ends with ].

Calculating total returns from a DataFrame

this is my first post here, I hope you will understand what troubles me.
So, I have a DataFrame that contains prices for some 1200 companies for each day, beginning in 2010. Now I want to calculate the total return for each one. My DataFrame is indexed by date. I could use the
df.iloc[-1]/df.iloc[0] method, but some companies started trading publicly at a later date, so I can't get the results for those companies, as they are divided by a NaN value. I've tried by creating a list which contains the first valid indexes for every stock(column), then when I try to calculate the total returns, I get - the wrong result!
I've tried a classic for loop:
for l in list:
returns = df.iloc[-1]/df.iloc[l]
For instance, last price of one stock was around $16, and first data I have is $1.5, which would be over 10 times return, yet my result is only about 1.1! I would also like to add that the aforementioned list includes first valid indexes for Date aswell, and it is in the first position.
Can somebody please help me? Thank you very much

Many ways you can go about this actually. But I do recommend you brush up on your python skills with basic examples before you get into more complicated examples.
If you want to do it your way, you can do it like this:
returns = {}
for stock_name in df.columns:
returns[stock_name] = df[stock_name].dropna().iloc[-1] / df[stock_name].dropna().iloc[0]
A more pythonic way would be to do it in a vectorized form, like this:
returns = ((1 + data.ffill().pct_change())
.cumprod()
.iloc[-1])

How to create a customised data frame based on column values in python?

I have a initial dummy dataframe with 7 columns, 1 row and given columns names and initialised zeros
d = pandas.DataFrame(numpy.zeros((1, 7)))
d = d.rename(columns={0:"Gender_M",
1:"Gender_F",
2:"Employed_Self",
3:"Employed_Employee",
4:"Married_Y",
5:"Married_N",
6:"Salary"})
Now I have a single record
data = [['M', 'Employee', 'Y',85412]]
data_test = pd.DataFrame(data, columns = ['Gender', 'Employed', 'Married','Salary'])
From the single record I have to create a new dataframe, where if the
Gender column has M, then Gender_M should be changed to 1, Gender_F left with zero
Employed column has Employee, then Employed_Employee changed to 1, Employed_Self with zero
same with Married and for the integer column Salary, just set the value 85412, I tried with if statements, but its a long set of codes, is there a simple way?

Here is one way using update twice
d.update(df)
df.columns=df.columns+'_'+df.astype(str).iloc[0]
df.iloc[:]=1
d.update(df)

Alas homework is often designed to be boring and repetitive ...
You do not have a problem - rather you want other people to do the work for you. SO is not for this purpose - post a problem, you will find many people willing to help.
So show your FULL answer then ask for "Is there a better way"

How to access values in groupby dataframe with multiple labels?

I am trying to find the values inside dataframe that has been grouped.
I grouped payment data with time the person borrowed the money and months it took for the person to pay and summed the amount they paid. My goal is to find the list of months it took for people to pay back.
For example, how can I know the list of 'month_taken' when start_yyyymm is 201807?
payment_sum_monthly =
payment_data.groupby(['start_yyyymm','month_taken'])
[['amount']].sum()
If I use R and put the payment data in data.table form, I can find out the list of month_taken by
payment_sum_monthly[start_yyyymm == '201807',month_taken]
How can I do this in Python? Thanks.

is_date = payment_data['start_yyyymm'] == "201807"
It should give you all the entities that has 'start_yyyymm' is 201807. Then to call those entities, you can code following:
date_set = payment_data[is_date].copy()
payment_sum_monthly = date_set .groupby('month_taken').aggregate(sum)
payment_sum_monthly
And if you need one more condition you can do following:
condition2 = payment_data['column name'] == condition
payment_data[is_date & condition2]
I hope I got your question right and it helps

Incrementing Values in Column Based on Values in Another (Pandas)

I'd like to find a simple way to increment values in in one column that correspond to a particular date in Pandas. This is what I have so far:
old_casual_value = tstDF['casual'].loc[tstDF['ds'] == '2012-10-08'].values[0]
old_registered_value = tstDF['registered'].loc[tstDF['ds'] == '2012-10-08'].values[0]
# Adjusting the numbers of customers for that day.
tstDF.set_value(406, 'casual', old_casual_value*1.05)
tstDF.set_value(406, 'registered', old_registered_value*1.05)
If I could find a better and simpler way to do this (a one-liner), I'd greatly appreciate it.
Thanks for your help.

The following one liner should work based on your limited description of your problem. If not, please provide more information.
#The code below first filters out the records based on specified date and then increase casual and regisitered column values by 1.05 times.
tstDF.loc[tstDF['ds'] == '2012-10-08',['casual','registered']]*=1.05

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Compare columns (per row) of two DataFrames in Python - python

Related

Dropping Rows that Contain a Specific String wrapped in square brackets?

Calculating total returns from a DataFrame

How to create a customised data frame based on column values in python?

How to access values in groupby dataframe with multiple labels?

Incrementing Values in Column Based on Values in Another (Pandas)

Categories

Resources