removing rows with given criteria - python

I am beginer with both python and pandas and I came across an issue I can't handle on my own.
What I am trying to do is:
1) remove all the columns except three that I am interested in
2) remove all rows which contains serveral strings in column "asset number". And here is difficult part. I removed all the blanks but I can't remove other ones because nothing happens (example with string "TECHNOLOGIES" - tried part of the word and whole word and both don't work.
Here is the code:
import modin.pandas as pd
File1 = 'abi.xlsx'
df = pd.read_excel(File1, sheet_name = 'US JERL Dec-19')
df = df[['asset number','Cost','accumulated depr']] #removing other columns
df = df.dropna(axis=0, how='any', thresh=None, subset=None, inplace = False)
df = df[~df['asset number'].str.contains("TECHNOLOGIES, INC", na=False)]
df.to_excel("abi_output.xlsx")
And besides that, file has 600k rows and it loads so slow to see the output. Do you have any advice for it?
Thank you!
#Kenan - thank you for your answer. Now the code looks like below but it still doesn't remove rows which contains in chosen column specified stirngs. I also attached screenshot of the output to show you that the rows still exist. Any thoughts?
import modin.pandas as pd
File1 = 'abi.xlsx'
df = pd.read_excel(File1, sheet_name = 'US JERL Dec-19', usecols=['asset number','Cost','accumulated depr'])
several_strings = ['', 'TECHNOLOGIES', 'COST CENTER', 'Account', '/16']
df = df[~df['asset number'].isin(several_strings)]
df.to_excel("abi_output.xlsx")
rows still are not deleted
#Andy
I attach sample of the input file. I just changed the numbers in two columns because these are confidential and removed not needed columns (removing them with code wasn't a problem).
Here is the link. Let me know if this is not working properly.
enter link description here

You can combine your first two steps with:
df = pd.read_excel(File1, sheet_name = 'US JERL Dec-19', usecols=['asset number','Cost','accumulated depr'])
I assume this is what your trying to remove
several_strings = ['TECHNOLOGIES, INC','blah','blah']
df = df[~df['asset number'].isin(several_string)]
df.to_excel("abi_output.xlsx")
Update
Based on the link you provided this might be a better approach
df = df[df['asset number'].str.len().eq(7)]

the code your given is correct. so I guess maybe there is something wrong with your strings in columns 'asset number', can you give some examples for a code check?

Related

How can I merge the numerous data of two columns within the same DataFrame?

here is a pic of df1 = fatalities
So, in order to create a diagram that displays the years with the most injuries(i have an assignment about plane crash incidents in Greece from 2000-2020), i need to create a column out of the minor_injuries and serious_injuries ones.
So I had a first df with more data, but i tried to catch only the columnw that i needed, so we have the fatalities df1, which contains the years, the fatal_injuries, the minor_injuries, the serious_injuries and the total number of incident per year(all_incidents). What i wish to do, is merge the minor and serious injuries in a column named total_injuries or just injuries.
import pandas as pd
​ pd.set_option('display.max_rows', None)
df = pd.read_csv('all_incidents_cleaned.csv')
df.head()
df\['Year'\] = pd.to_datetime(df.incident_date).dt.year
fatalities = df.groupby('Year').fatalities.value_counts().unstack().reset_index()fatalities\
['all_incidents'\] = fatalities\[\['Θανάσιμος τραυματισμός',
'Μικρός τραυματισμός','Σοβαρός τραυματισμός', 'Χωρίς Τραυματισμό'\]\].sum(axis=1)
df\['percentage_deaths_to_all_incidents'\] = round((fatalities\['Θανάσιμος
τραυματισμός'\]/fatalities\['all_incidents'\])\*100,1)
df1 = fatalities
fatalities_pd = pd.DataFrame(fatalities)
df1
fatalities_pd.rename(columns = {'Θανάσιμος τραυματισμός':'fatal_injuries','Μικρός τραυματισμός':
'minor_injuries', 'Σοβαρός τραυματισμός' :'serious_injuries', 'Χωρίς Τραυματισμό' :
'no_injuries'}, inplace = True)
df1
For your current dataset two steps are needed.
First i would replace the "NaN" values with 0.
This could be done with:
df1.fillna(0)
Then you can create a new column "total_injuries" with the sum of minor and serious injuries:
df1["total_injuries"]=df1["minor_injuries"]+df1["serious_injuries"]
Its always nice when you first check your data for consistency before working on it. Helpful commands would look like:
data.shape
data.info()
data.isna().values.any()
data.duplicated().values.any()
duplicated_rows = data[data.duplicated()]
len(duplicated_rows)
data.describe()

Adding column titles between current titles in pandas

I'm relatively new to coding so may be an easy answer! Basically I'm using pandas to import data and I want to add a column header between the original header titles. I've added the code with the names= section showing essentially what I would like to see. Help with how that is actually implemented would be a great help as I am very stuck.
dfFQExp = pd.read_csv(fileFQExp, delimiter='\s+', names=["Original header1", "error1", "Original header2", "error2"....])
Thanks!
If you would like to rename the column names, you can do it this way:
By location:
dfFQExp.rename(columns={ dfFQExp.columns[0]: 'new header1'}, inplace = True)
By original name:
dfFQExp.rename(columns={ 'Original header1': 'new header1'}, inplace = True)

Using pandas to categories text data in one column and have corresponding categories stated in the next column

My excel spread sheet currently looks like this after inserting the new column "Expense" by using the code:
import pandas as pd
df = pd.read_csv(r"C:\Users\Mihir Patel\Project\Excel & CSV Stuff\June '20 CSVData.csv")
df_Expense = df.insert(2, "Expense", " ")
df.to_excel(r"C:\Users\Mihir Patel\Project\Excel & CSV Stuff\June '20 CSVData.xlsx", index=None, header=True)
So because the Description column contains the word "DRAKES" I can categories that expense as "Personal" which should appear in the Expense column next to it.
Similarly the next one down contains "Optus" is categorized as a mobile related expense so the word "Phone" should appear in the Expense column.
I have tried searching on Google and YouTube but I just can't seem to find an example for something like this.
Thanks for your help.
You can define a function which has all these rules and simply apply it. For ex.
def rules(x):
if "DRAKES" in x.description:
return "Personal"
if "OPUS" in x.description:
return "Mobile"
df["Expense"] = df.apply(lambda x: rules(x), axis=1)
I have solved my problem by using a while loop. I tried to use the method in quest's answer but I most likely didn't use it properly and kept getting an error. So I used a while loop to search through each individual cell in the "Description" column and categories it in the same row on the "Expense" column.
My solution using a while loop:
import pandas as pd
df = pd.read_csv("C:\\Users\\Mihir Patel\\PycharmProjects\\pythonProject\\June '20 CSVData.csv")
df.insert(2, "Expenses", "")
description = "Description"
expense = "Expenses"
transfer = "Transfer"
i = -1 #Because I wanted python to start searching from index 0
while i < 296: #296 is the row where my data ends
i = i + 1
if "Drakes".upper() in df.loc[i, description]:
df.loc[i, expense] = "Personal"
if "Optus".upper() in df.loc[i, description]:
df.loc[i, expense] = "Phone"
df.sort_values(by=["Expenses"], inplace=True)
df.to_excel("C:\\Users\\Mihir Patel\\PycharmProjects\\pythonProject\\June '20 CSVData.xlsx", index=False)

Pandas: count number of times every value in one column appears in another column

I want to count the number of times a value in Child column appears in Parent column then display this count in new column renamed child count. See previews df below.
I have this done via VBA (COUNTIFS) but now need dynamic visualization and animated display with data fed from a dir. So I resorted to Python and Pandas and tried below code after searching and reading answers like: Countif in pandas with multiple conditions | Determine if value is in pandas column | Iterate over rows in Pandas df | many others...
but still can't get the expected preview as illustrated in image below.
Any help will be very much appreciated. Thanks in advance.
#import libraries
import pandas as pd
import numpy as np
import os
#get datasets
path_dataset = r'D:\Auto'
df_ns = pd.read_csv(os.path.join(path_dataset, 'Scripts', 'data.csv'), index_col = False, encoding = 'ISO-8859-1', engine = 'python')
#preview dataframe
df_ns
#tried
df_ns.groupby(['Child','Parent', 'Site Name']).size().reset_index(name='child count')
#preview output
df_ns.groupby(['Child','Parent', 'Site Name']).size().reset_index(name='child count')
preview dataframe
preview output
expected output
[Edited] My data
Child = ['Tkt01', 'Tkt02', 'Tkt03', 'Tkt04', 'Tkt05', 'Tkt06', 'Tkt07', 'Tkt08', 'Tkt09', 'Tkt10']
Parent = [' ', ' ', 'Tkt03',' ',' ', 'Tkt03',' ', 'Tkt03',' ',' ', 'Tkt06',' ',' ',' ',]
Site_Name =[Yaounde','Douala','Bamenda','Bafoussam','Kumba','Garoua','Maroua','Ngaoundere','Buea','Ebolowa']
I created a lookalike of your df.
Before
Try this code
df['Count'] = [len(df[df['parent'].str.contains(value)]) for index, value in enumerate(df['child'])]
#breaking it down as a line by line code
counts = []
for index, value in enumerate(df['child']):
found = df[df['parent'].str.contains(value)]
counts.append(len(found))
df['Count'] = counts
After
Hope this works for you.
Since I don't have access to your data, I cannot check the code I am giving you. I suggest you will have problems with nan values with this line but you can give it a try.:
df_ns['child_count'] = df_ns['Parent'].groupby(df_ns['Child']).value_counts()
I give a name to the new column and directly assign values to it through the groupby -> value_counts functions.

Manipulating columns in txt file using Python

I have been trying to create a script for this task for a while now but whatever direction i go in, i always find a dead end so here i am looking for help. Even though this may seem like a simple task, i'm fairly new to Python and the way everything works so any help would be greatly appreciated.
File Data
In this picture we can see five labelled columns. The idea of the script would be to sum the 'Units' column, and as well as this, multiply the 'Unit' Column by the 'Dealer Price' to give us a revenue. I also want to group this by 'Consumer Country' and 'Currency Code'
I have written an SQL query to help:
SELECT SUM(Units*Dealer_Price)
SUM(Units)
Consumer_Country,
Currency Code
FROM Sales File
GROUP BY Consumer_Country, Currency_Code
I have this so far ( Thanks to #ParvBanks & #Martin Frodl)
df = pandas.read_csv('data.csv', Header=None, encoding='utf-8', sep='\t')
df['Revenue'] = df['Units'] * df['Dealer Price']
df = df.groupby(['Consumer Country', 'Currency Code']).sum()
df = df[['Revenue', 'Units']]
Any help would be greatly appreciated :)
import pandas as pd
df = pd.read_csv('data.csv')
df['Revenue'] = df['Units'] * df['Dealer Price']
df = df.groupby(['Consumer Country', 'Currency Code']).sum()
df = df[['Revenue', 'Units']]

Categories