Pandas drop first columns after csv read - python

Is there a way to reference an object within the line of the instantiation ?
See the following example :
I wanted to drop the first column (by index) of a csv file just after reading it (usually pd.to_csv outputs the index as first col) :
df = pd.read_csv(csvfile).drop(self.columns[[0]], axis=1)
I understand self should be placed in the object context but it here describes what I intent to do.
(Of course, doing this operation in two separate lines works perfectly.)

One way is to use pd.DataFrame.iloc:
import pandas as pd
from io import StringIO
mystr = StringIO("""col1,col2,col3
a,b,c
d,e,f
g,h,i
""")
df = pd.read_csv(mystr).iloc[:, 1:]
# col2 col3
# 0 b c
# 1 e f
# 2 h i

Assuming you know the total number of columns in the dataset, and the indexes you want to remove -
a = range(3)
a.remove(1)
df = pd.read_csv('test.csv', usecols = a)
Here 3 is the total number of columns, and I wanted to remove 2nd column. You can directly write index of columns to use

Related

How to extract inside of column to several columns

I have excel file and import to dataframe. I want to extract inside of column to several columns.
Here is original
After importing to pandas in python, I get this data with '\n'
So, I want to extract inside of column. Could you all share idea or code?
My expected columns are....
Don't worry no one is born knowing everything about SO. Considering the data you gave, specially that 'Vector:...' is not separated by '\n', the following works:
import pandas as pd
import numpy as np
data = pd.read_excel("the_data.xlsx")
ok = []
l = len(data['Details'])
for n in range(l):
x = data['Details'][n].split()
x[2] = x[2].lstrip('Vector:')
x = [v for v in x if v not in ['Type:', 'Mission:']]
ok += x
values = np.array(ok).reshape(l, 3)
df = pd.DataFrame(values, columns=['Type', 'Vector', 'Mission'])
data.drop('Details', axis=1, inplace=True)
final = pd.concat([data, df], axis=1)
The process goes like this:
First you split all elements of the Details columns as a list of strings. Second you deal with the 'Vector:....' special case and filter column names. Third you store all the values in a list which will inturn be converted to a numpy array with shape (length, 3). Finally you drop the old 'Details' column and perform a concatenation with the df created from splited strings.
You may want to try a more efficient way to transform your data when reading by trying to use this ideas inside the pd.read_excel method using converters

Pandas python help - can't seem to get the code to do what I need it to

I have to write a script to read a csv file and drop columns with a '.' and read strings False and True as 0 and 1. I have been able to code the drop columns fine but I'm having trouble with having pandas read False as a 0 and True as a 1 so that the entire data can be seen in numbers. All other data is float so I believe I must iterate to find False and True and get it to be read as 0 and 1. Below is the code I have so far and I have attached a sample data
import numpy as np
import pandas as pd
[![enter image description here][1]][1]
def ImportCustomScript(filename):
data = pd.read_csv(filename, sep=',', header=None)
cols = data.shape[1]
data_list=[]
for i in range(cols):
if i=='.' in data.columns:
data.drop([i], axis=1)
data_list.append([data[i][0], np.array(data[cols][1:], dtype='<f8')])
data.replace('False',0).replace('True',1)
data_frame = pd.DataFrame(data_list)
return data_frame
could perhaps do this, though more info on your actual data would be useful
print(df)
col col2 col3.
0 True 5 6
1 False 6 3
2 True 32 5
3 False 3 9
df = df[df.columns[~df.columns.str.contains('\.')]]
df['col'] = df['col'].astype(int) #or df.loc[:,'col'] = df['col'].astype(int)
The best way to achieve both tasks is to mask the columns according to the properties you want.
For the first part, assuming data is your DataFrame, you can mask out (using ~) the columns containing a dot:
data = data[~data.columns.str.contains("\.")]
For the second part, as pd.read_csv already recognizes boolean columns, you can convert only the boolean columns to int using pd.DataFrame.astype:
data = data.astype({col: int for col in data.columns[data.dtypes == bool]})
EDIT
If for some reason the pd.read_csv does not automatically convert the "True" and "False" strings into booleans, you can use the following solution to first recognize the columns containing only these strings and then replace them with your integer codes:
for col in data.columns[data.dtypes == object]:
if data[col].str.match("^True|False$").all():
data[col].replace({"True": 1, "False": 0}, inplace=True)
EDIT 2
According to the image that you attached recently, the strings "True" and "False" are mixed with other values. My previous answer, instead, refers to the situation where all values of a column are formed by these strings.
Therefore, to achieve the result you want you can replace all values as follows:
def ImportCustomScript(filename):
# read the csv
df = pd.read_csv(filename, sep=',')
# remove the columns whose name contains a dot
df = df[~df.columns.str.contains("\.")]
# replace all "True"/"False" strings
df.replace({"True": 1, "False": 0}, inplace=True)
# if you want to convert everything to float
df = df.astype(float)
return df

pandas.read_excel with identical column names in excel

When i import an excel table with pandas.read_excel there is a problem (or a feature :-) ) with identical column names. For example the excel-file has two columns named "dummy", after the import in a datframe the second column is named "dummy.1".
Is there a way to import without the renaming option ?
Now I don't see the point why you would want this. However, as I could think of a workaround I might as well post it.
import pandas as pd
cols = pd.read_excel('text.xlsx', header=None,nrows=1).values[0] # read first row
df = pd.read_excel('text.xlsx', header=None, skiprows=1) # skip 1 row
df.columns = cols
print(df)
Returns:
col1 col1
0 1 1
1 2 2
2 3 3

How to compare two CSV files and get the difference?

I have two CSV files,
a1.csv
city,state,link
Aguila,Arizona,https://www.glendaleaz.com/planning/documents/AppendixAZONING.pdf
AkChin,Arizona,http://www.maricopa-az.gov/zoningcode/wp-content/uploads/2014/05/Zoning-Code-Rewrite-Public-Review-Draft-3-Tracked-Edits-lowres1.pdf
Aguila,Arizona,http://www.co.apache.az.us/planning-and-zoning-division/zoning-ordinances/
a2.csv
city,state,link
Aguila,Arizona,http://www.co.apache.az.us
I want to get the difference.
Here is my attempt:
import pandas as pd
a = pd.read_csv('a1.csv')
b = pd.read_csv('a2.csv')
mask = a.isin(b.to_dict(orient='list'))
# Reverse the mask and remove null rows.
# Upside is that index of original rows that
# are now gone are preserved (see result).
c = a[~mask].dropna()
print c
Expected Output:
city,state,link
Aguila,Arizona,https://www.glendaleaz.com/planning/documents/AppendixAZONING.pdf
AkChin,Arizona,http://www.maricopa-az.gov/zoningcode/wp-content/uploads/2014/05/Zoning-Code-Rewrite-Public-Review-Draft-3-Tracked-Edits-lowres1.pdf
But I am getting an error:
Empty DataFrame
Columns: [city, state, link]
Index: []**
I want to check based on the first two rows, then if they are the same, remove it off.
You can use pandas to read in two files, join them and remove all duplicate rows:
import pandas as pd
a = pd.read_csv('a1.csv')
b = pd.read_csv('a2.csv')
ab = pd.concat([a,b], axis=0)
ab.drop_duplicates(keep=False)
Reference: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
First, concatenate the DataFrames, then drop the duplicates while still keeping the first one. Then reset the index to keep it consistent.
import pandas as pd
a = pd.read_csv('a1.csv')
b = pd.read_csv('a2.csv')
c = pd.concat([a,b], axis=0)
c.drop_duplicates(keep='first', inplace=True) # Set keep to False if you don't want any
# of the duplicates at all
c.reset_index(drop=True, inplace=True)
print(c)

Pandas append data frames, add a field, and then flood the field with a default value?

I have several data frames that contain all of the same column names. I want to append them into a master data frame. I also want to create a column that denotes the original field and then flood it with the original data frames name. I have some code that works.
df_combine = df_breakfast.copy()
df_combine['X_ORIG_DF'] = 'Breakfast'
df_combine = df_combine.append(df_lunch, ignore_index=True)
df_combine['X_ORIG_DF'] = df_combine['X_ORIG_DF'].fillna('Lunch')
# Rinse and repeat
However, it seems inelegant. I was hoping someone could point me to a more elegant solution. Thank you in advance for your time!
Note: Edited to reflect comment!
I would definitely consider restructuring you data in a way the names can be accessed neatly rather than as variable names (if they must be separate to begin with).
For example a dictionary:
d = {'breakfast': df_breakfast, 'lunch': df_lunch}
Create a function to give each DataFrame a new column:
def add_col(df, col_name, col_entry):
df = df.copy() # so as not to change df_lunch etc.
df[col_name] = col_entry
return df
and combine the list of DataFrame each with the appended column ('X_ORIG_DF'):
In [3]: df_combine = pd.DataFrame().append(list(add_col(v, 'X_ORIG_DF', k)
for k, v in d.items()))
Out[3]:
0 1 X_ORIG_DF
0 1 2 lunch
1 3 4 lunch
0 1 2 breakfast
1 3 4 breakfast
In this example: df_lunch = df_breakfast = pd.DataFrame([[1, 2], [3, 4]]).
I've encountered a similar problem as you when trying to combine multiple files together for the purpose of analysis in a master dataframe. Here is one method for creating that master dataframe by loading each dataframe independently, giving them each an identifier in a column called 'ID' and combining them. If your data is a list of files in a directory called datadir I would do the following:
import os
import pandas as pd
data_list = os.listdir(datadir)
df_dict = {}
for data_file in data_list:
df = read_table(data_file)
#add an ID column based on the file name.
#you could use some other naming scheme of course
df['ID'] = data_file
df_dict[data_file] = df
#the concat function is great for combining lots of dfs.
#it takes a list of dfs as an argument.
combined_df_with_named_column = pd.concat(df_dict.values())

Categories