Incrementing Values in Column Based on Values in Another (Pandas) - python

I'd like to find a simple way to increment values in in one column that correspond to a particular date in Pandas. This is what I have so far:
old_casual_value = tstDF['casual'].loc[tstDF['ds'] == '2012-10-08'].values[0]
old_registered_value = tstDF['registered'].loc[tstDF['ds'] == '2012-10-08'].values[0]
# Adjusting the numbers of customers for that day.
tstDF.set_value(406, 'casual', old_casual_value*1.05)
tstDF.set_value(406, 'registered', old_registered_value*1.05)
If I could find a better and simpler way to do this (a one-liner), I'd greatly appreciate it.
Thanks for your help.

The following one liner should work based on your limited description of your problem. If not, please provide more information.
#The code below first filters out the records based on specified date and then increase casual and regisitered column values by 1.05 times.
tstDF.loc[tstDF['ds'] == '2012-10-08',['casual','registered']]*=1.05

Related

Compare columns (per row) of two DataFrames in Python

first of all, I'm quite new to programming overall (< 2 Months), so I'm sorry if that's an 'simple, no need to ask for help, try it yourself until you get it done' problem.
I have two data-frames with partially the same content (general overview of mobile-numbers including their cost centers in the company and monthly invoices with the affected mobile-numbers and their invoice amount).
I'd like to compare the content of the 'mobile-numbers' column of the monthly invoices DF to the content of the 'mobile-numbers' column of the general overview DF and if matching, assign the respective cost center to the mobile-number in the monthly invoices DF.
I'd love to share my code with you, but unfortunately I have absolutely zero clue how to solve that problem in any way.
Thanks
Edit: I'm from germany, I tried my best to explain the problem in english. If there is anything I messed up (so u dont get it) just tell me :)
Example of desired result
program meets your needs, in the second dataframe I put the value '40' to demonstrate that the dataframes already filled will not be zeroed, the replacement will only occur if there is a similar value between the dataframes, if you want a better explanation about the program , comment below, and don't forget to vote and mark as solved, I also put some 'prints' for a better view, but in general they are not necessary
import pandas as pd
general_df = pd.DataFrame({"mobile_number": [1234,3456,6545,4534,9874],
"cost_center": ['23F','67F','32W','42W','98W']})
invoice_df = pd.DataFrame({"mobile_number": [4534,5567,1234,4871,1298],
"invoice_amount": ['19,99E','19,99E','19,99E','19,99E','19,99E'],
"cost_center": ['','','','','40']})
print(f"""GENERAL OVERVIEW DF
{general_df}
________________________________________
INVOICE DF
{invoice_df}
_________________________________________
INVOICE RESULT
""")
def func(line):
t = 0
for x in range(0, len(general_df['mobile_number'])):
t = general_df.loc[general_df['mobile_number'] == line[0]]
if t.empty:
return line[2]
else:
return t.values.tolist()[0][1]
invoice_df['cost_center'] = invoice_df.apply(func, axis = 1)
print(invoice_df)

Printing and counting unique values from an .xlsx file

I'm fairly new to Python and still learning the ropes, so I need help with a step by step program without using any functions. I understand how to count through an unknown column range and output the quantity. However, for this program, I'm trying to loop through a column, picking out unique numbers and counting its frequency.
So I have an excel file with random numbers down column A. I only put in 20 numbers but let's pretend the range is unknown. How would I go about extracting the unique numbers and inputting them into a separate column along with how many times they appeared in the list?
I'm not really sure how to go about this. :/
unique = 1
while xw.Range((unique,1)).value != None:
frequency = 0
if unique != unique: break
quantity += 1
"end"
I presume as you can't use functions this may be homework...so, high level:
You could first go through the column and then put all the values in a list?
Secondly take the first value from the list and go through the rest of the list - is it in there? If so then it is not unique. Now remove the value where you have found the duplicate from the list. Keep going if you find another remove that too.
Take the second value and so on?
You would just need list comprehension, some loops and perhaps .pop()
Using pandas library would be the easiest way to do. I created a sample excel sheet having only one column called "Random_num"
import pandas
data = pandas.read_excel("sample.xlsx", sheet_name = "Sheet1")
print(data.head()) # This would give you a sneak peek of your data
print(data['Random_num'].value_counts()) # This would solve the problem you asked for
# Make sure to pass your column name within the quotation marks
#eg: data['your_column'].value_counts()
Thanks

Calculating total returns from a DataFrame

this is my first post here, I hope you will understand what troubles me.
So, I have a DataFrame that contains prices for some 1200 companies for each day, beginning in 2010. Now I want to calculate the total return for each one. My DataFrame is indexed by date. I could use the
df.iloc[-1]/df.iloc[0] method, but some companies started trading publicly at a later date, so I can't get the results for those companies, as they are divided by a NaN value. I've tried by creating a list which contains the first valid indexes for every stock(column), then when I try to calculate the total returns, I get - the wrong result!
I've tried a classic for loop:
for l in list:
returns = df.iloc[-1]/df.iloc[l]
For instance, last price of one stock was around $16, and first data I have is $1.5, which would be over 10 times return, yet my result is only about 1.1! I would also like to add that the aforementioned list includes first valid indexes for Date aswell, and it is in the first position.
Can somebody please help me? Thank you very much
Many ways you can go about this actually. But I do recommend you brush up on your python skills with basic examples before you get into more complicated examples.
If you want to do it your way, you can do it like this:
returns = {}
for stock_name in df.columns:
returns[stock_name] = df[stock_name].dropna().iloc[-1] / df[stock_name].dropna().iloc[0]
A more pythonic way would be to do it in a vectorized form, like this:
returns = ((1 + data.ffill().pct_change())
.cumprod()
.iloc[-1])

Matching strings in a pandas dataframe using fuzzywuzzy

I have two dataframes: Instructor_Info and Operator_Info
Instructor_Info contains a column called Names and OperatorName, and Operator_Info also has a column called Names. All names in Instructor_Info have an associated name in Operator Info. I want to use fuzz.token_sort_ratio() to find these matches by comparing each name in Instructor_Info to every name in Operator_Info and storing the string with the highest score in the OperatorName column.
This is what I have so far:
for index, row in Instructor_Info.iterrows():
match = 0
for index1,row1 in Operator_Info.iterrows():
if fuzz.token_sort_ratio(row['Names'],row1['Names']) > match:
row['OperatorName'] = row1['Names']
This code runs extremely slow and gets a couple of false matches (I can fix these manually so speed is the main issue). If anyone has any faster ideas it would be much appreciated. Thanks in advance.

Python pandas loop efficient through two dataframes with different lengths

I have two dataframes with different lengths(df,df1). They share one similar label "collo_number". I want to search the second dataframe for every collo_number in the first data frame. Problem is that the second date frame contains multiple rows for different dates for every collo_nummer. So i want to sum these dates and add this in a new column in the first database.
I now use a loop but it is rather slow and has to perform this operation for al 7 days in a week. Is there a way to get a better performance? I tried multiple solutions but keep getting the error that i cannot use the equal sign for two databases with different lenghts. Help would really be appreciated! Here is an example of what is working but with a rather bad performance.
df5=[df1.loc[(df1.index == nasa) & (df1.afleverdag == x1) & (df1.ind_init_actie=="N"), "aantal_colli"].sum() for nasa in df.collonr]
Your description is a bit vague (hence my comment). First what you good do is to select the rows of the dataframe that you want to search:
dftmp = df1[(df1.afleverdag==x1) & (df1.ind_init_actie=='N')]
so that you don't do this for every item in the loop.
Second, use .groupby.
newseries = dftmp['aantal_colli'].groupby(dftmp.index).sum()
newseries = newseries.ix[df.collonr.unique()]

Categories