Adding column titles between current titles in pandas - python

I'm relatively new to coding so may be an easy answer! Basically I'm using pandas to import data and I want to add a column header between the original header titles. I've added the code with the names= section showing essentially what I would like to see. Help with how that is actually implemented would be a great help as I am very stuck.
dfFQExp = pd.read_csv(fileFQExp, delimiter='\s+', names=["Original header1", "error1", "Original header2", "error2"....])
Thanks!

If you would like to rename the column names, you can do it this way:
By location:
dfFQExp.rename(columns={ dfFQExp.columns[0]: 'new header1'}, inplace = True)
By original name:
dfFQExp.rename(columns={ 'Original header1': 'new header1'}, inplace = True)

Related

create variables in python with available data

I have read a data like this:
import numpy as np
arr=n.loadtext("data/za.csv",delimeter=",")
display(arr)
Now the display looks like this:
array([[5.0e+01,1.8e+00,1.6e+00,1.75+e00],
[4.8e+01,1.77e+00,1.63e+00,1.75+e00],
[5.5e+01,1.8e+00,1.6e+00,1.75+e00],
...,
[5.0e+01,1.8e+00,1.6e+00,1.75+e00],
[4.8e+01,1.77e+00,1.63e+00,1.75+e00],
[5.0e+01,1.8e+00,1.6e+00,1.75+e00]])
Now I would like to give this variables to this array
the first ist weight of person
second is height of person
third is height of mother
fourth is height of father
Now I would like to now how can I create this variables that representin the columns?
install pandas library
import pandas as pd
use pd.read_csv("data/za.csv", columns= ["height", "weight", "etc"]) for read the data
hope you get the solution.
As it has already been advised, you may use pandas.read_csv for the purpose as per below:
df = pd.read_csv(**{
'filepath_or_buffer': "data/za.csv",
'header': None,
'names': ('weight_of_person', 'height_of_person', 'height_of_mother', 'height_of_father'),
})

Get only the name of a DataFrame - Python - Pandas

I'm actually working on a ETL project with crappy data I'm trying to get right.
For this, I'm trying to create a function that would take the names of my DFs and export them to CSV files that would be easy for me to deal with in Power BI.
I've started with a function that will take my DFs and clean the dates:
df_liste = []
def facture(x) :
x = pd.DataFrame(x)
for s in x.columns.values :
if s.__contains__("Fact") :
x.rename(columns= {s : 'periode_facture'}, inplace = True)
x['periode_facture'] = x[['periode_facture']].apply(lambda x : pd.to_datetime(x, format = '%Y%m'))
If I don't set 'x' as a DataFrame, it doesn't work but that's not my problem.
As you can see, I have set a list variable which I would like to increment with the names of the DFs, and the names only. Unfortunately, after a lot of tries, I haven't succeeded yet so... There it is, my first question on Stack ever!
Just in case, this is the first version of the function I would like to have:
def export(x) :
for df in x :
df.to_csv(f'{df}.csv', encoding='utf-8')
You'd have to set the name of your dataframe first using df.name (probably, when you are creating them / reading data into them)
Then you can access the name like a normal attribute
import pandas as pd
df = pd.DataFrame( data=[1, 2, 3])
df.name = 'my df'
and can use
df.to_csv(f'{df.name}.csv', encoding='utf-8')

melt some values of a column to a separate column

I have a dataframe like below:
But I want to have new data frame with the sate is a seperate column as below:
DO you know how to do it using Python? Thank you so much.
If you provide an example dataset it would be helpful and we can work on it. I created an example dataset like the table below:
numbers were given randomly.
I am not sure if there is an easy way. You should put all your states in a list beforehand. The main idea behind my approach is detecting the empty rows between the states. The first string coming after the empty row is the state name and filling this name until the empty row is reached. (since there might be another country name like the United States and probably comes from an empty row, we created the states list beforehand to avoid mistakes.)
Here is my approach:
import pandas as pd
import numpy as np
data = pd.read_excel("data.xlsx")
states = ["Alabama","Alaska","Arizona"]
data['states'] = np.nan #creating states column
flag = ""
for index, value in data['location'].iteritems():
if pd.notnull(value):
if value in states:
flag = value
data['states'].iloc[index] = flag
#relocating 'states' column to the second position in the dataframe
column = data.pop('states')
data.insert(1,'states',column)
And the result:
Well, let's say we have this data:
data = {
'County':[
' USA',
'',
' Alabama',
'Autauga',
'Baldwin',
'Barbour',
'',
' Alaska',
'Aleutians',
'Anchorage',
'',
' Arizona',
'Apache',
'Cochise'
]
}
df = pd.DataFrame(data)
We could use empty lines as marks of a new state like this:
spaces = (df.County == '')
states = spaces.shift().fillna(False)
df['States'] = df.loc[states, 'County']
df['States'].ffill(inplace=True)
In the code above states is a mask of cells under empty lines, where states are located. At the next step we connect states by genuine index to the new column. After that we apply forward fill of NaN values which will duplicate each states until the next one.
Additionally we could do some cleaning. This, IMO, would be more relevant somewhat earlier, but anyway:
df['States'] = df['States'].fillna('')
df.loc[spaces, 'States'] = ''
df.loc[states, 'States'] = ''
This method rely on the structure with spaces between states. Let's make something different in case if there's no spaces between states. Say we have some data like this (no empty rows, no spaces around names):
data = [
'USA',
'Alabama',
'Autauga',
'Baldwin',
'Barbour',
'Alaska',
'Aleutians',
'Anchorage',
'Arizona',
'Apache',
'Cochise'
]
df = pd.DataFrame(data, columns=['County'])
We can work with a list of known states and pandas.Series.isin in this case. All the other logic can stay the same:
States = ['Alabama','Alaska','Arizona',...]
mask = df.County.isin(States)
df = df.assign(**{'States':df.loc[mask, 'County']}).ffill()

Pandas: count number of times every value in one column appears in another column

I want to count the number of times a value in Child column appears in Parent column then display this count in new column renamed child count. See previews df below.
I have this done via VBA (COUNTIFS) but now need dynamic visualization and animated display with data fed from a dir. So I resorted to Python and Pandas and tried below code after searching and reading answers like: Countif in pandas with multiple conditions | Determine if value is in pandas column | Iterate over rows in Pandas df | many others...
but still can't get the expected preview as illustrated in image below.
Any help will be very much appreciated. Thanks in advance.
#import libraries
import pandas as pd
import numpy as np
import os
#get datasets
path_dataset = r'D:\Auto'
df_ns = pd.read_csv(os.path.join(path_dataset, 'Scripts', 'data.csv'), index_col = False, encoding = 'ISO-8859-1', engine = 'python')
#preview dataframe
df_ns
#tried
df_ns.groupby(['Child','Parent', 'Site Name']).size().reset_index(name='child count')
#preview output
df_ns.groupby(['Child','Parent', 'Site Name']).size().reset_index(name='child count')
preview dataframe
preview output
expected output
[Edited] My data
Child = ['Tkt01', 'Tkt02', 'Tkt03', 'Tkt04', 'Tkt05', 'Tkt06', 'Tkt07', 'Tkt08', 'Tkt09', 'Tkt10']
Parent = [' ', ' ', 'Tkt03',' ',' ', 'Tkt03',' ', 'Tkt03',' ',' ', 'Tkt06',' ',' ',' ',]
Site_Name =[Yaounde','Douala','Bamenda','Bafoussam','Kumba','Garoua','Maroua','Ngaoundere','Buea','Ebolowa']
I created a lookalike of your df.
Before
Try this code
df['Count'] = [len(df[df['parent'].str.contains(value)]) for index, value in enumerate(df['child'])]
#breaking it down as a line by line code
counts = []
for index, value in enumerate(df['child']):
found = df[df['parent'].str.contains(value)]
counts.append(len(found))
df['Count'] = counts
After
Hope this works for you.
Since I don't have access to your data, I cannot check the code I am giving you. I suggest you will have problems with nan values with this line but you can give it a try.:
df_ns['child_count'] = df_ns['Parent'].groupby(df_ns['Child']).value_counts()
I give a name to the new column and directly assign values to it through the groupby -> value_counts functions.

removing rows with given criteria

I am beginer with both python and pandas and I came across an issue I can't handle on my own.
What I am trying to do is:
1) remove all the columns except three that I am interested in
2) remove all rows which contains serveral strings in column "asset number". And here is difficult part. I removed all the blanks but I can't remove other ones because nothing happens (example with string "TECHNOLOGIES" - tried part of the word and whole word and both don't work.
Here is the code:
import modin.pandas as pd
File1 = 'abi.xlsx'
df = pd.read_excel(File1, sheet_name = 'US JERL Dec-19')
df = df[['asset number','Cost','accumulated depr']] #removing other columns
df = df.dropna(axis=0, how='any', thresh=None, subset=None, inplace = False)
df = df[~df['asset number'].str.contains("TECHNOLOGIES, INC", na=False)]
df.to_excel("abi_output.xlsx")
And besides that, file has 600k rows and it loads so slow to see the output. Do you have any advice for it?
Thank you!
#Kenan - thank you for your answer. Now the code looks like below but it still doesn't remove rows which contains in chosen column specified stirngs. I also attached screenshot of the output to show you that the rows still exist. Any thoughts?
import modin.pandas as pd
File1 = 'abi.xlsx'
df = pd.read_excel(File1, sheet_name = 'US JERL Dec-19', usecols=['asset number','Cost','accumulated depr'])
several_strings = ['', 'TECHNOLOGIES', 'COST CENTER', 'Account', '/16']
df = df[~df['asset number'].isin(several_strings)]
df.to_excel("abi_output.xlsx")
rows still are not deleted
#Andy
I attach sample of the input file. I just changed the numbers in two columns because these are confidential and removed not needed columns (removing them with code wasn't a problem).
Here is the link. Let me know if this is not working properly.
enter link description here
You can combine your first two steps with:
df = pd.read_excel(File1, sheet_name = 'US JERL Dec-19', usecols=['asset number','Cost','accumulated depr'])
I assume this is what your trying to remove
several_strings = ['TECHNOLOGIES, INC','blah','blah']
df = df[~df['asset number'].isin(several_string)]
df.to_excel("abi_output.xlsx")
Update
Based on the link you provided this might be a better approach
df = df[df['asset number'].str.len().eq(7)]
the code your given is correct. so I guess maybe there is something wrong with your strings in columns 'asset number', can you give some examples for a code check?

Categories