how to slice and create multiple pandas dataframes from a singe dataframe - python

I am reading an excel file using pandas. I want to create multiple data frames from the original data frame. each data frame name should be the row 1 heading. Also, how to skip the one column between each transaction.
Expected result:
transaction_1:
name id available capacity completed all
transaction_2:
name id available capacity completed all
transaction_3:
name id available capacity completed all
What I tried:
import pandas as pd
import pprint as pp
pd.options.display.width = 0
pd.options.display.max_columns = 999
pd.options.display.max_rows = 999
df = pd.read_excel(r'capacity.xlsx', sheet_name='Sprint Details', header=0)
df1 = df.iloc[:, 0:3]
print(df1)

You can try this (works with pd.__version__ == 1.1.1):
df = (pd.read_excel(
"capacity.xlsx", sheet_name="Sprint Details", header=[0, 1], index_col=[0, 1]
)
.dropna(axis=1, how="all")
.rename_axis(index=["name", "id"], columns=[None, None]))
transaction_1 = df["transaction_1"].reset_index()
transaction_2 = df["transaction_2"].reset_index()
transaction_3 = df["transaction_3"].reset_index()
Essentially, we need to read the sheet in as a dataframe with a MultiIndex. The first 2 rows are our column names header=[0,1]. Whereas the first 2 columns are our index that will be used for each "subtable" index_col=[0,1].
Because there are spaces in each table, we will have columns that are entirely NaN so we drop those with .dropna(axis=1, how="all").
Because pandas does not expect the index names and columns to be in the same row, it should incorrectly parse your index column names ["name", "id"] as the name of the second level of the column index. To remedy this, we can manually assign the correct index name, while also removing the column index names via rename_axis(index=["name", "id"], columns=[None, None])
Now that we have a nicely formatted table with a MultiIndex column, we can simply slice out each table, and call .reset_index() on each to ensure that each table has the "name" and "id" as a column in each table.
Edit: Seems we have a parsing difference between our versions of pandas.
Option 1.
If you can directly modify the excel sheet to include another row (to better separate the columns from the index names). This will provide the most robust results.
The following code works:
df = (pd.read_excel(
"capacity.xlsx", sheet_name="Sprint Details", header=[0, 1], index_col=[0, 1]
)
.dropna(axis=1, how="all"))
transaction_1 = df["transaction_1"].reset_index()
transaction_2 = df["transaction_2"].reset_index()
transaction_3 = df["transaction_3"].reset_index()
Option 2
If you can not modify the excel file, we'll need a more roundabout method unfortunately.
df = pd.read_excel("capacity.xlsx", header=[0,1]).dropna(axis=1, how="all")
index = pd.MultiIndex.from_frame(df.iloc[:, :2].droplevel(0, axis=1))
df = df.iloc[:, 2:].set_axis(index)
transaction_1 = df["transaction_1"].reset_index()
transaction_2 = df["transaction_2"].reset_index()
transaction_3 = df["transaction_3"].reset_index()

Related

Dropping index in DataFrame for CSV file

Working with a CSV file in PyCharm. I want to delete the automatically-generated index column. When I print it, however, the answer I get in the terminal is "None". All the answers by other users indicate that the reset_index method should work.
If I just say "df = df.reset_index(drop=True)" it does not delete the column, either.
import pandas as pd
df = pd.read_csv("music.csv")
df['id'] = df.index + 1
cols = list(df.columns.values)
df = df[[cols[-1]]+cols[:3]]
df = df.reset_index(drop=True, inplace=True)
print(df)
I agree with #It_is_Chris. Also,
This is not true because return is None:
df = df.reset_index(drop=True, inplace=True)
It's should be like this:
df.reset_index(drop=True, inplace=True)
or
df = df.reset_index(drop=True)
Since you said you're trying to "delete the automatically-generated index column" I could think of two solutions!
Fist solution:
Assign the index column to your dataset index column. Let's say your dataset has already been indexed/numbered, then you could do something like this:
#assuming your first column in the dataset is your index column which has the index number of zero
df = pd.read_csv("yourfile.csv", index_col=0)
#you won't see the automatically-generated index column anymore
df.head()
Second solution:
You could delete it in the final csv:
#To export your df to a csv without the automatically-generated index column
df.to_csv("yourfile.csv", index=False)

Find the difference between data frames based on specific columns and output the entire record

I want to compare 2 csv (A and B) and find out the rows which are present in B but not in A in based only on specific columns.
I found few answers to that but it is still not giving result what I expect.
Answer 1 :
df = new[~new['column1', 'column2'].isin(old['column1', 'column2'].values)]
This doesn't work. It works for single column but not for multiple.
Answer 2 :
df = pd.concat([old, new]) # concat dataframes
df = df.reset_index(drop=True) # reset the index
df_gpby = df.groupby(list(df.columns)) #group by
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1] #reindex
final = df.reindex(idx)
This takes as an input specific columns and also outputs specific columns. I want to print the whole record and not only the specific columns of the record.
I tried this and it gave me the rows:
import pandas as pd
columns = [{Name of columns you want to use}]
new = pd.merge(A, B, how = 'right', on = columns)
col = new['{Any column from the first DataFrame which isn't in the list columns. You will probably have to add an '_x' at the end of the column name}']
col = col.dropna()
new = new[~new['{Any column from the first DataFrame which isn't in the list columns. You will probably have to add an '_x' at the end of the column name}'].isin(col)]
This will give you the rows based on the columns list. Sorry for the bad naming. If you want to rename the columns a bit too, here's the code for that:
for column in new.columns:
if '_x' in column:
new = new.drop(column, axis = 1)
elif '_y' in column:
new = new.rename(columns = {column: column[:column.find('_y')]})
Tell me if it works.

removing columns in a loop from different size dataframes [duplicate]

I am reading from an Excel sheet and I want to read certain columns: column 0 because it is the row-index, and columns 22:37. Now here is what I do:
import pandas as pd
import numpy as np
file_loc = "path.xlsx"
df = pd.read_excel(file_loc, index_col=None, na_values=['NA'], parse_cols = 37)
df= pd.concat([df[df.columns[0]], df[df.columns[22:]]], axis=1)
But I would hope there is better way to do that! I know if I do parse_cols=[0, 22,..,37] I can do it, but for large datasets this doesn't make sense.
I also did this:
s = pd.Series(0)
s[1]=22
for i in range(2,14):
s[i]=s[i-1]+1
df = pd.read_excel(file_loc, index_col=None, na_values=['NA'], parse_cols = s)
But it reads the first 15 columns which is the length of s.
You can use column indices (letters) like this:
import pandas as pd
import numpy as np
file_loc = "path.xlsx"
df = pd.read_excel(file_loc, index_col=None, na_values=['NA'], usecols="A,C:AA")
print(df)
Corresponding documentation:
usecols : int, str, list-like, or callable default None
If None, then parse all columns.
If str, then indicates comma separated list of Excel column letters and column ranges (e.g. “A:E” or “A,C,E:F”). Ranges are inclusive of both sides.
If list of int, then indicates list of column numbers to be parsed.
If list of string, then indicates list of column names to be parsed.
New in version 0.24.0.
If callable, then evaluate each column name against it and parse the column if the callable returns True.
Returns a subset of the columns according to behavior above.
New in version 0.24.0.
parse_cols is deprecated, use usecols instead
that is:
df = pd.read_excel(file_loc, index_col=None, na_values=['NA'], usecols = "A,C:AA")
"usecols" should help, use range of columns (as per excel worksheet, A,B...etc.)
below are the examples
1. Selected Columns
df = pd.read_excel(file_location,sheet_name='Sheet1', usecols="A,C,F")
2. Range of Columns and selected column
df = pd.read_excel(file_location,sheet_name='Sheet1', usecols="A:F,H")
3. Multiple Ranges
df = pd.read_excel(file_location,sheet_name='Sheet1', usecols="A:F,H,J:N")
4. Range of columns
df = pd.read_excel(file_location,sheet_name='Sheet1', usecols="A:N")
If you know the names of the columns and do not want to use A,B,D or 0,4,7. This actually works
df = pd.read_excel(url)[['name of column','name of column','name of column','name of column','name of column']]
where "name of column" = columns wanted. Case and whitespace sensitive
Read any column's data in excel
import pandas as pd
name_of_file = "test.xlsx"
data = pd.read_excel(name_of_file)
required_colum_name = "Post test Number"
print(data[required_colum_name])
Unfortunately these methods still seem to read and convert the headers before returning the subselection. I have an Excel sheet with duplicate header names because the sheet contains several similar tables. I want to read those tables individually, so I would want to apply usecols. However, this still add suffixes to the duplicate column names.
To reproduce:
create an Excel sheet with headers named Header1, Header2, Header1, Header2 under columns A, B, C, D
df.read_excel(filename, usecols='C:D')
df.columns will return ['Header1.1', 'Header2.1']
Is there way to circumvent this, aside from splitting and joining the resulting headers? Especially when it is unknown whether there are duplicate columns it is tricky to rename them, as splitting on '.' may be corrupting a non-duplicate header.
Edit: additionally, the length (in indeces) of a DataFrame based on a subset of columns will be determined by the length of the full file. So if column A has 10 rows, and column B only has 5, a DataFrame generated by usecols='B' will have 10 rows of which 5 filled with NaN's.

Drop an empty column of the dataframe does not work

My question has been asked multiple times and I implemented the provided answers but none of them worked. I have a dataframe which contains an object column where all of its cells are empty strings. I have been trying to drop it through using the following methods separately each time:
data.dropna()
data.dropna(axis=1, inplace=True)
data.dropna(axis='columns', how='all', inplace=True)
data.mask(data.astype(bool)).dropna(axis=1, how='all')
data.dropna(subset=['columnName'], inplace=True)
filter = data['columnName'] != ""
data = data[filter]
Also, once I tried to replace the empty cells with Nan by using:
data['columnName'].replace('', np.nan, inplace=True)
and then drop the column but strangely nothing was even changed to NaN in the corresponding column. In the above lines of code whereever it was required I assigned the result of the methods to data again but non of them worked. I wonder what to use instead that works?
This is a sample data:
BUILDING CATEGORY MEANS OF ACCESS ADDRESS PRICE
rental UK £700000
commercial UK £5000000
I intend to drop MEANS OF ACCESS.
UPDATE
The code snippet is as follows:
# Remove the initial 2 rows
data = pd.read_csv(file, skiprows=2, low_memory=False)
# Remove the irrelevant columns
data = data.drop(['REGION', 'NUMBER'], axis=1)
# Remove '$' sign
data['PRICE'] = [x[1:] for x in data['PRICE']]
columns = ['WHOLE UNITS', 'AREA', 'PRICE']
# Remove comma
data[columns] = data[columns].apply(lambda x: x.str.replace(',', ''))
# Convert to numeric
data[columns] = data[columns].apply(pd.to_numeric)
# Remove duplicate rows
data.drop_duplicates(inplace=True)
print((data['MEANS OF ACCESS'] == "").sum()) #returns 0 but it shouldn't
If you want to drop the column 'column_name', then simply use
df = df.drop(labels=['column_name'], axis=1)
If you want to drop all columns which contain only empty strings, then use
df = df.replace('', pd.NA).dropna(axis=1, how='all')

Removing rows of duplicate headers or strings same columns and blank lines in pandas in python

I have a sample data (Data_sample_truncated.txt) which I truncated from a big data. It has 3 fields - "Index", "Time" and "RxIn.Density[**x**, ::]" Here I used x as integer as x can vary for any range. In this data it is 0-15. The combination of the 3 column fields is unique. For different "Index" field the "Time" and "RxIn.Density[**x**, ::]" can be same or different. For each new "Index" value the data has a blank line and almost similar column headers except for "RxIn.Density[**x**, ::]" where x is increasing when new "Index" value is reached. The data which I export from ADS (circuit simulation software) gives me like this format while exporting.
Now I want to format the data so that all the data are merged together under 3 unique column fields - "Index", "Time" and "RxIn.Density". You can see I want to remove the strings [**x**, ::] in the new dataframe of the 3rd column. Here is the sample final data file that I want after formatting (Data-format_I_want_after_formatting.txt). So I want the following -
The blank lines (or rows) to be removed
All the other header lines to be removed keeping the top header only and changing the 3rd column header to "RxIn.Density"
Keeping all the data merged under the unique column fields - "Index", "Time" and "RxIn.Density", even if the data values are duplicate.
My MATLAB code is in the below:
import pandas as pd
#create DataFrame from csv with columns f and v
df = pd.read_csv('Data_sample_truncated.txt', sep="\s+", names=['index','time','v'])
#boolean mask for identify columns of new df
m = df['v'].str.contains('RxIn')
#new column by replace NaNs by forward filling
df['g'] = df['v'].where(m).ffill()
#get original ordering for new columns
#cols = df['g'].unique()
#remove rows with same values in v and g columns
#df = df[df['v'] != df['g']]
df = df.drop_duplicates(subset=['index', 'time'], keep=False)
df.to_csv('target.txt', index=False, sep='\t')
The generated target.txt file is not what I wanted. You can check it here. Can anyone help what is wrong with my code and what to do to fix it so that I wan my intended formatting?
I am using Spyder 3.2.6 (Anaconda) where python 3.6.4 64-bit is embedded.
You can just filter out rows, that you do not want(check this):
import pandas as pd
df = pd.read_csv('Data_sample_truncated.txt', sep="\s+")
df.columns = ["index","time","RxIn.Density","1"]
del df["1"]
df = df[~df["RxIn.Density"].str.contains("Rx")].reset_index(drop=True)
df.to_csv('target.txt', index=False, sep='\t')
Try this:
df = pd.read_csv('Data_sample_truncated.txt', sep='\s+', names=['index', 'time', 'RxIn.Density', 'mask'], header=None)
df = df[df['mask'].isna()].drop(['mask'], axis=1)
df.to_csv('target.txt', index=False, sep='\t')

Categories