Python [Pandas/docx]: Merging two rows based on common name - python

I am trying to write a script using docx-python and pandas in Python3 which perform following action:
Take input from csv file
Merge common value of Column C and add each value into docx
Export docx
My raw csv is as below:
SN. Name Instance Severity
1 Name1 file1:line1 CRITICAL
2 Name1 file2:line3 CRITICAL
3 Name2 file1:line1 Low
4 Name2 file1:line3 Low
and so on...
and i want my docx outpur as:
`
[1]: https://i.stack.imgur.com/1xNc0.png
I am not able to figure it out how can i filter "Instances" based on "Name" using pandas and later print then into docx.
Thanks in advance.

Below code will select the relevant columns,group by based on 'Name' and 'Severity' and add Instances together
df2 = df[["Name","Instance","Severity"]]
df2["Instance"] = df2.groupby(['Name','Severity'])['Instance'].transform(lambda x: '\n'.join(x))
Finally, remove the duplicates and transform to get the desired output
df2 = df2.drop_duplicates()
df2 = df2.T

Related

How to arrange data in dataframes according to headers python?

I am using python 3.9 on Spyder, I receive data frames from a source where I can not control how the data is received. However, I know that the data is grouped under a certain header. When trying to group the data using pandas it is failing. Below is an example of the received dataframes and the output needed.
and below is how I want it to be arranged.
Any ideas on how I can achieve this? Note that I have a very large amount of data so I am searching for a method with reduced memory usage.
Edit: I had a typo in name and age, I also added that the headers are different than name and age such as column1 and column2.
Assuming your main DataFrame is the traditional variable df:
# Create a copy of the dataframe
df2 = df.copy()
# Look in the Age field where the right-side is non-numeric;
# Set that value to name
df.loc[df2["Age"].str.match(r"^\w+=\D+$"), "Name"] = df2.loc[df2["Age"].str.match(r"^\w+=\D+$"), "Age"]
# Do the opposite for the other field.
df.loc[df2["Name"].str.match(r"^\w+=\d+$"), "Age"] = df2.loc[df2["Name"].str.match(r"^\w+=\d+$"), "Name"]
Output of df:
Name Age
0 Age=John Age=25
1 Name=Roy Age=36
2 Name=Smith Age=19
3 Name=Donald Age=12
4 Name=jason Age=57
5 Name=joe Age=1
If every value begins with Name=... or Age=..., maybe simple .transform() will help:
df.loc[:, ["Name", "Age"]] = df.loc[:, ["Age", "Name"]].transform(
sorted, axis=1
)
print(df)
Prints:
Name Age
0 Name=John Age=25
1 Name=Roy Age=36
2 Name=Smith Age=19
3 Name=Donald Age=12
4 Name=Jason Age=57
5 Name=Joe Age=1
P.S.: I'm assuming first row should be Name=John, not Age=John (but the code should be the same).

Join 2 text file data based on a column value

How to implement a joining of 2 text files using python and output a third file but only adding values present in one file that have corresponding matching value in second file?
Input File1.txt:
GameScore|KeyNumber|Order
85|2568909|2|
84|2672828|1|
80|2689999|5|
65|2123232|3|
Input File2.txt:
KeyName|RecordNumber
Andy|2568909|
John|2672828|
Andy|2672828|
Megan|1000021|
Required Output File3.txt:
KeyName|KeyNumber|GameScore|Order
Andy|2672828|84|1|
Andy|2568909|85|2|
John|2672828|84|1|
Megan|1000021||
Look for a key name and record number in File 2 and match it with KeyNumber in file 1 and copy the corresponding game score and order values.
The files have anywhere from 1 to 500000 records so need to be able to run for a large set.
Edit: I do not have access to any libraries like pandas and not allowed to install any libraries.
Essentially need to run a cmd that will trigger a program that does the reads 2 files, compares and generates the third file.
You can use pandas to do this:
import pandas as pd
df1 = pd.read_csv('Input File1.txt', sep='|')
df2 = pd.read_csv('Input File2.txt', sep='|', header=0, names=['KeyName', 'KeyNumber'])
df3 = df1.merge(df2, on='KeyNumber', how='right')
See the documentation for fine-tuning.

how to divide pandas dataframe into different dataframes based on unique values from one column and itterate over that?

I have a dataframe with three columns
The first column has 3 unique values I used the below code to create unique dataframes, However I am unable to iterate over that dataframe and not sure how to use that to iterate.
df = pd.read_excel("input.xlsx")
unique_groups = list(df.iloc[:,0].unique()) ### lets assume Unique values are 0,1,2
mtlist = []
for index, value in enumerate(unique_groups):
globals()['df%s' % index] = df[df.iloc[:,0] == value]
mtlist.append('df%s' % index)
print(mtlist)
O/P
['df0', 'df1', 'df2']
for example lets say I want to find out the length of the first unique dataframe
if I manually type the name of the DF I get the correct output
len(df0)
O/P
35
But I am trying to automate the code so technically I want to find the length and itterate over that dataframe normally as i would by typing the name.
What I'm looking for is
if I try the below code
len('df%s' % 0)
I want to get the actual length of the dataframe instead of the length of the string.
Could someone please guide me how to do this?
I have also tried to create a Dictionary using the below code but I cant figure out how to iterate over the dictionary when the DF columns are more than two, where key would be the unique group and the value containes the two columns in same line.
df = pd.read_excel("input.xlsx")
unique_groups = list(df["Assignment Group"].unique())
length_of_unique_groups = len(unique_groups)
mtlist = []
df_dict = {name: df.loc[df['Assignment Group'] == name] for name in unique_groups}
Can someone please provide a better solution?
UPDATE
SAMPLE DATA
Assignment_group Description Document
Group A Text to be updated on the ticket 1 doc1.pdf
Group B Text to be updated on the ticket 2 doc2.pdf
Group A Text to be updated on the ticket 3 doc3.pdf
Group B Text to be updated on the ticket 4 doc4.pdf
Group A Text to be updated on the ticket 5 doc5.pdf
Group B Text to be updated on the ticket 6 doc6.pdf
Group C Text to be updated on the ticket 7 doc7.pdf
Group C Text to be updated on the ticket 8 doc8.pdf
Lets assume there are 100 rows of data
I'm trying to automate ServiceNow ticket creation with the above data.
So my end goal is GROUP A tickets should go to one group, however for each description an unique task has to be created, but we can club 10 task once and submit as one request so if I divide the df's into different df based on the Assignment_group it would be easier to iterate over(thats the only idea which i could think of)
For example lets say we have REQUEST001
within that request it will have multiple sub tasks such as STASK001,STASK002 ... STASK010.
hope this helps
Your problem is easily solved by groupby: one of the most useful tools in pandas. :
length_of_unique_groups = df.groupby('Assignment Group').size()
You can do all kind of operations (sum, count, std, etc) on your remaining columns, like getting the mean value of price for each group if that was a column.
I think you want to try something like len(eval('df%s' % 0))

scientific notation in pandas

I have two CSV files both files when I read an print the files the output is like:
tweet_id id
312498024964313000 3.430000e+17
312278640362659000 3.430000e+17
The id and tweet_id both columns are in the same format and required sample output is :
tweet_id id
3.124980e+17 3.430660e+17
3.122790e+17 3.430880e+17
Please tell me how to solve this problem.
I later use both of these columns to merge two CSV files.
You can set the float_format with pd.set_option. Just change both columns to float first:
pd.set_option('display.float_format', '{:.6g}'.format)
df.astype(float)
tweet_id id
0 3.12498e+17 3.43e+17
1 3.12279e+17 3.43e+17
Note: Your expected output for id doesn't seem to match your input. The above result is based on the sample input provided.
Hey i think you have to combine them into 1 data frame. You can tell because your printed statement on the bottom starts at a 0 index as well.
Use the .join method to join them, then try print them:
data_frame1 = pd.DataFrame(data=['a','b', 'c'], columns=['Alphabit'])
data_frame2 = pd.DataFrame(data=[1,2,3], columns = ['Numbers'])
data_frame1.join(data_frame2)
edit: sorry I think i misinterpreted your original question.

How do I extract variables that repeat from an Excel Column using Python?

I'm a beginner at Python and I have a school proyect where I need to analyze an excel document with information. It has aproximately 7 columns and more than 1000 rows.
Theres a column named "Materials" that starts at B13. It contains a code that we use to identify some materials. The material code looks like this -> 3A8356. There are different material codes in the same column they repeat a lot. I want to identify them and make a list with only one code, no repeating. Is there a way I can analyze the column and extract the codes that repeat so I can take them and make a new column with only one of each material codes?
An example would be:
12 Materials
13 3A8356
14 3A8376
15 3A8356
16 3A8356
17 3A8346
18 3A8346
and transform it toosomething like this:
1 Materials
2 3A8346
3 3A8356
4 3A8376
Yes.
If df is your dataframe, you only have to do df = df.drop_duplicates(subset=['Materials',], keep=False)
To load the dataframe from an excel file, just do:
import pandas as pd
df = pd.read_excel(path_to_file)
the subset argument indicates which column headings you want to look at.
Docs: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
For the docs, the new data frame with the duplicates dropped is returned so you can assign it to any variable you want. If you want to re_index the first column, take a look at:
new_data_frame = new_data_frame.reset_index(drop=True)
Or simply
new_data_frame.reset_index(drop=True, inplace=True)

Categories