I'm very new to trying use python with my work,
here's an excel worksheet I got:
A as product name in July,
B as product Qty in July,
C as product name in Aug,
D as product Qty in Aug
I needed to get the result of difference between them:
find exactly sold Qty in next month
calculated the subtract
|A | B|C | D|
|SRHAVWRQT | 1|SRHAVWRQT | 4|
|SCMARAED3MQT| 0|SCMARAED3MQT| 2|
|WVVMUMCRQT | 7|WVVMUMCBR | 7|
...
...
I know how to solved this in excel like what I did,
with INDEX + MATCH and the difference:
=G3-INDEX(B:B,MATCH(F3,A:A,0))
than I would having the result as I need
The original data
The result perform
but how am I would perform it in python?
and which tool would be use?
(e.g. pandas? numpy?)
with other answer I'd read, but it seems just performed only INDEX/MATCH function
and/or they are trying to solve the calculation between Multiple Sheet
but I just need the result of 2 columns.
How to perform an Excel INDEX MATCH equivalent in Python
Index match with python
Calculate Match percentage of values from One sheet to another Using Python
https://focaalvarez.medium.com/vlookup-and-index-match-equivalences-in-pandas-160ac2910399
Or there's just will be a completely different way of processing in python
A classic use case there. For anything involving Excel and Python, you'll want to familiarise yourself with the Pandas library; it can handle a lot of what you're asking for.
Now, to how to solve this problem in particular. I'm going to assume that the data in the relevant worksheet is as you showed it above; No column headings, with the data from row 1 down in columns A, B, C and D. You could use the below code to load this into Python; This loading would load it without column or row names, and as such the dataframe loaded in python would start at [0,0] rather than "A1", since rows and columns in Pandas start at 0.
import pandas as pd
excelData = pd.read_excel("<DOCUMENT NAME>", sheet_name="<SHEET NAME>", header=None)
After you have loaded the data, you then need to match the month 1 data to its month 2 indices. This is a little complicated, and the way I recommend doing it involves defining your own python function using the "def" keyword. A version of this I quickly whipped up is below:
#Extract columns "A & B" and "C & D" to separate month 1 and 2 dataframes respectively.
month1_data: pd.DataFrame = excelData.iloc[:, 0:2]
month2_data: pd.DataFrame = excelData.iloc[:, 2:4]
#Define a matching function to match a single row (series) with its corresponding row in a passed dataframe
def matchDataToIndex(dataLine: pd.Series, comparison: pd.DataFrame):
matchIndex = list(comparison.iloc[0].tolist()).index(dataLine.tolist()[0])
return dataLine.append(pd.Series([matchIndex]))
#Apply the defined matching function to each row of the month 1 data
month1_data_with_match = month1_data.apply(matchDataToIndex, axis=1, args=(month2_data,))
There is a lot of stuff there that you are probably not familiar with if you are only just getting started with Python, hence why I recommend getting acquainted with Pandas. That being said, after that is run, the variable month1_data_with_match will be a three column table, containing the following columns:
Your Column A product name data.
Your Column B product quantity data.
An index expressing which row in month2_data contains the matching Column C and D data.
With those three pieces of information together, you should then be able to calculate your other statistics.
Related
I want to subtract or remove the words in one dataframe from another dataframe in each row.
This is the main table/columns of a pyspark dataframe.
+----------+--------------------+
| event_dt| cust_text|
+----------+--------------------+
|2020-09-02|hi fine i want to go|
|2020-09-02|i need a line hold |
|2020-09-02|i have the 60 packs|
|2020-09-02|hello want you teach|
Below is another pyspark dataframe. The words in this dataframe needs to be removed from the above main table in column cust_text wherever the words occur in each row. For example, 'want' will be removed from every row wherever it shows up in 1st dataframe.
+-------+
|column1|
+-------+
| want|
|because|
| need|
| hello|
| a|
| have|
| go|
+-------+
This can be done in pyspark or pandas. I have tried googling the solution using Python, Pyspark, pandas, but still not able to remove the words from the main table based on a single column table.
The result should look like this:
+----------+--------------------+
| event_dt| cust_text|
+----------+--------------------+
|2020-09-02|hi fine i to |
|2020-09-02|i line hold |
|2020-09-02|i the 60 packs |
|2020-09-02|you teach |
+----------+--------------------+
If you want to remove just the word in the corresponding line of df2, you could do that as follows, but it will probably be slow for large data sets, because it only can partially can use fast C implementations:
# define your helper function to remove the string
def remove_string(ser_row):
return ser_row['cust_text'].replace(ser_row['remove'], '')
# create a temporary column with the string to remove in the first dataframe
df1['remove']= df2['column1']
df1= df1.apply(remove_string, axis='columns')
# drop the temporary column afterwards
df1.drop(columns=['remove'], inplace=True)
The result looks like:
Out[145]:
0 hi fine i to go
1 i need lines hold
2 i have the 60 packs
3 can you teach
dtype: object
If however, you want to remove all words in your df2 column from every column, you need to do it differntly. Unfortunately str.replace does not help here with regular strings, unless you want to call it for every line in your second dataframe.
So if your second dataframe is not too large, you can create a regular expression to make use of str.replace.
import re
replace=re.compile(r'\b(' + ('|'.join(df2['column1'])) + r')\b')
df1['cust_text'].str.replace(replace, '')
The output is:
Out[184]:
0 hi fine i to
1 i lines hold
2 i the 60 packs
3 can you teach
Name: cust_text, dtype: object
If you don't like the repeated spaces, that remain, you can just perform something like:
df1['cust_text'].str.replace(replace, '').str.replace(re.compile('\s{2,}'), ' ')
Addition: what, if not only the text without the words is relevant, but the words themselves as well. How can we get the words, which were replaced. Here is one attempt, which would work, if one character can be identified, which will not appear in the text. Let's assume this character is a #, then you could do (on the original column value without replacement):
# enclose each keywords in #
ser_matched= df1['cust_text'].replace({replace: r'#\1#'}, regex=True)
# now remove the rest of the line, which is unmatched
# this is the part of the string after the last occurance
# of a #
ser_matched= ser_matched.replace({r'^(.*)#.*$': r'\1', '^#': ''}, regex=True)
# and if you like your keywords to be in a list, rather than a string
# you can split the string at last
ser_matched.str.split(r'#+')
This solution would be specific to pandas. If I understand your challenge correctly, you want to remove all words from column cust_text that occur in column1 of the second DataFrame. Lets give the corresponding DataFrames the names: df1 and df2. This is how you would do this:
for i in range(len(df1)):
sentence = df1.loc[i, "cust_text"]
for j in range(len(df2)):
delete_word = df2.loc[j, "column1"]
if delete_word in sentence:
sentence = sentence.replace(delete_word, "")
df1.loc[i, "cust_text"] = sentence
I have assigned variables to certain data points in these dataframes (sentence and delete_word), but that is just for the sake of understanding. You can easily condense this code to a few lines shorter by not doing that.
I have to exclude rows from dataframe where column Justification does not contains word spare in it :
"Justification":"WIWYNN | MSASM Spares | 21| MDM: 2520171"
I have tried following ways but nothing worked.(i am using spark python)
df= df.where(~ df["Justification"].like("%spares%"))
df = df.where(~(col("Justification").like("%spare%")))
df = df.where("Justification not like '%spare%'")
Results are returned with rows where justification column has spare word in it even though i have done negation.
I want exact opposite result
Try with the below code . Like is CaseSensitive. You need to use a Lower function before comparing.
df.where(~(lower(col("Justification")).like("%spare%")))
I'm a beginner at Python and I have a school proyect where I need to analyze an excel document with information. It has aproximately 7 columns and more than 1000 rows.
Theres a column named "Materials" that starts at B13. It contains a code that we use to identify some materials. The material code looks like this -> 3A8356. There are different material codes in the same column they repeat a lot. I want to identify them and make a list with only one code, no repeating. Is there a way I can analyze the column and extract the codes that repeat so I can take them and make a new column with only one of each material codes?
An example would be:
12 Materials
13 3A8356
14 3A8376
15 3A8356
16 3A8356
17 3A8346
18 3A8346
and transform it toosomething like this:
1 Materials
2 3A8346
3 3A8356
4 3A8376
Yes.
If df is your dataframe, you only have to do df = df.drop_duplicates(subset=['Materials',], keep=False)
To load the dataframe from an excel file, just do:
import pandas as pd
df = pd.read_excel(path_to_file)
the subset argument indicates which column headings you want to look at.
Docs: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
For the docs, the new data frame with the duplicates dropped is returned so you can assign it to any variable you want. If you want to re_index the first column, take a look at:
new_data_frame = new_data_frame.reset_index(drop=True)
Or simply
new_data_frame.reset_index(drop=True, inplace=True)
I am new to Python and I'm trying to produce a similar result of Excel's IndexMatch function with Python & Pandas, though I'm struggling to get it working.
Basically, I have 2 separate DataFrames:
The first DataFrame ('market') has 7 columns, though I only need 3 of those columns for this exercise ('symbol', 'date', 'close'). This df has 13,948,340 rows.
The second DataFrame ('transactions') has 14 columns, though only I only need 2 of those columns ('i_symbol', 'acceptance_date'). This df has 1,428,026 rows.
My logic is: If i_symbol is equal to symbol and acceptance_date is equal to date: print symbol, date & close. This should be easy.
I have achieved it with iterrows() but because of the size of the dataset, it returns a single result every 3 minutes - which means I would have to run the script for 1,190 hours to get the final result.
Based on what I have read online, itertuples should be a faster approach, but I am currently getting an error:
ValueError: too many values to unpack (expected 2)
This is the code I have written (which currently produces the above ValueError):
for i_symbol, acceptance_date in transactions.itertuples(index=False):
for symbol, date in market.itertuples(index=False):
if i_symbol == symbol and acceptance_date == date:
print(market.symbol + market.date + market.close)
2 questions:
Is itertuples() the best/fastest approach? If so, how can I get the above working?
Does anyone know a better way? Would indexing work? Should I use an external db (e.g. mysql) instead?
Thanks, Matt
Regarding question 1: pandas.itertuples() yields one namedtuple for each row. You can either unpack these like standard tuples or access the tuple elements by name:
for t in transactions.itertuples(index=False):
for m in market.itertuples(index=False):
if t.i_symbol == m.symbol and t.acceptance_date == m.date:
print(m.symbol + m.date + m.close)
(I did not test this with data frames of your size but I'm pretty sure it's still painfully slow)
Regarding question 2: You can simply merge both data frames on symbol and date.
Rename your "transactions" DataFrame so that it also has columns named "symbol" and "date":
transactions = transactions[['i_symbol', 'acceptance_date']]
transactions.columns = ['symbol','date']
Then merge both DataFrames on symbol and date:
result = pd.merge(market, transactions, on=['symbol','date'])
The result DataFrame consists of one row for each symbol/date combination which exists in both DataFrames. The operation only takes a few seconds on my machine with DataFrames of your size.
#Parfait provided the best answer below as a comment. Very clean, worked incredibly fast - thank you.
pd.merge(market[['symbol', 'date', 'close']], transactions[['i_symbol',
'acceptance_date']], left_on=['symbol', 'date'], right_on=['i_symbol',
'acceptance_date']).
No need for looping.
I have a dataset like this:
Policy | Customer | Employee | CoveragDate | LapseDate
123 | 1234 | 1234 | 2011-06-01 | 2015-12-31
124 | 1234 | 1234 | 2016-01-01 | ?
125 | 1234 | 1234 | 2011-06-01 | 2012-01-01
124 | 5678 | 5555 | 2014-01-01 | ?
I'm trying to iterate through each policy for each employee of each customer (a customer can have many employees, an employee can have multiple policies) and compare the covered date against the lapse date for a particular employee. If the covered date and lapse date are within 5 days, I'd like to add that policy to a results list.
So, expected output would be:
Policy | Customer | Employee
123 | 1234 | 1234
because policy 123's lapse date was within 5 days of policy 124's covered date.
I'm running into a problem while trying to iterate through each grouping of Customer/Employee numbers. I'm able to identify how many rows of data are in each EmployeeID/Customer number (EBCN below) group, but I need to reference specific data within those rows to assign variables for comparison.
So far, I've been able to write this code:
import pandas
import datetime
wd = pandas.read_csv(DATASOURCE)
l = 0
for row, i in wd.groupby(['EMPID', 'EBCN']).size().iteritems():
Covdt = pandas.to_datetime(wd.loc[l, 'CoverageEffDate'])
for each in range(i):
LapseDt = wd.loc[l, 'LapseDate']
if LapseDt != '?':
LapseDt = pandas.to_datetime(LapseDt) + datetime.timedelta(days=5)
if Covdt < LapseDt:
print('got one!')
l = l + 1
This code is not working because I'm trying to reference the coverage date/lapse dates on a particular row with the loc function, with my row number stored in the 'l' variable. I initially thought that Pandas would iterate through groups in the order they appear in my dataset, so that I could simply start with l=0 (i.e. the first row in the data), assign the coverage date and lapse date variables based on that, and then move on, but it appears that Pandas starts iterating through groups randomly. As a result, I do indeed get a comparison of lapse/coverage dates, but they're not associated with the groups that end up getting output by the code.
The best solution I can figure is to determine what the row number is for the first row of each group and then iterate forward by the number of rows in that group.
I've read through a question regarding finding the first row of a group, and am able to do so by using
wd.groupby(['EMPID','EBCN']).first()
but I haven't been able to figure out what row number the results are stored on in a way that I can reference with the loc function. Is there a way to store the row number for the first row of a group in a variable or something so I can iterate my coverage date and lapse date comparison forward from there?
Regarding my general method, I've read through the question here, which is very close to what I need:
pandas computation in each group
however, I need to compare each policy in the group against each other policy in the group - the question above just compares the last row in each group against the others.
Is there a way to do what I'm attempting in Pandas/Python?
For anyone needing this information in the future - I was able to implement Boud's suggestion to use the pandas.merge_asof() function to replace my code above. I had to do some data manipulation to get the desired result:
Splitting the dataframe into two separate frames - one with CoverageDate and one with LapseDate.
Replacing the '?' (null values) in my data with a numpy.nan datatype
Sorting the left and right dataframes by the Date columns
Once the data was in the correct format, I implemented the merge:
pandas.merge_asof(cov, term,
on='Date',
by='EMP|EBCN',
tolerance=pandas.Timedelta('5 days'))
Note 'cov' is my dataframe containing coverage dates, term is the dataframe with lapses. The 'EMP|EBCN' column is a concatenated column of the employee ID and Customer # fields, to allow easy use of the 'by' field.
Thanks so much to Boud for sending me down the correct path!