removing columns in a loop from different size dataframes [duplicate] - python

I am reading from an Excel sheet and I want to read certain columns: column 0 because it is the row-index, and columns 22:37. Now here is what I do:
import pandas as pd
import numpy as np
file_loc = "path.xlsx"
df = pd.read_excel(file_loc, index_col=None, na_values=['NA'], parse_cols = 37)
df= pd.concat([df[df.columns[0]], df[df.columns[22:]]], axis=1)
But I would hope there is better way to do that! I know if I do parse_cols=[0, 22,..,37] I can do it, but for large datasets this doesn't make sense.
I also did this:
s = pd.Series(0)
s[1]=22
for i in range(2,14):
s[i]=s[i-1]+1
df = pd.read_excel(file_loc, index_col=None, na_values=['NA'], parse_cols = s)
But it reads the first 15 columns which is the length of s.

You can use column indices (letters) like this:
import pandas as pd
import numpy as np
file_loc = "path.xlsx"
df = pd.read_excel(file_loc, index_col=None, na_values=['NA'], usecols="A,C:AA")
print(df)
Corresponding documentation:
usecols : int, str, list-like, or callable default None
If None, then parse all columns.
If str, then indicates comma separated list of Excel column letters and column ranges (e.g. “A:E” or “A,C,E:F”). Ranges are inclusive of both sides.
If list of int, then indicates list of column numbers to be parsed.
If list of string, then indicates list of column names to be parsed.
New in version 0.24.0.
If callable, then evaluate each column name against it and parse the column if the callable returns True.
Returns a subset of the columns according to behavior above.
New in version 0.24.0.

parse_cols is deprecated, use usecols instead
that is:
df = pd.read_excel(file_loc, index_col=None, na_values=['NA'], usecols = "A,C:AA")

"usecols" should help, use range of columns (as per excel worksheet, A,B...etc.)
below are the examples
1. Selected Columns
df = pd.read_excel(file_location,sheet_name='Sheet1', usecols="A,C,F")
2. Range of Columns and selected column
df = pd.read_excel(file_location,sheet_name='Sheet1', usecols="A:F,H")
3. Multiple Ranges
df = pd.read_excel(file_location,sheet_name='Sheet1', usecols="A:F,H,J:N")
4. Range of columns
df = pd.read_excel(file_location,sheet_name='Sheet1', usecols="A:N")

If you know the names of the columns and do not want to use A,B,D or 0,4,7. This actually works
df = pd.read_excel(url)[['name of column','name of column','name of column','name of column','name of column']]
where "name of column" = columns wanted. Case and whitespace sensitive

Read any column's data in excel
import pandas as pd
name_of_file = "test.xlsx"
data = pd.read_excel(name_of_file)
required_colum_name = "Post test Number"
print(data[required_colum_name])

Unfortunately these methods still seem to read and convert the headers before returning the subselection. I have an Excel sheet with duplicate header names because the sheet contains several similar tables. I want to read those tables individually, so I would want to apply usecols. However, this still add suffixes to the duplicate column names.
To reproduce:
create an Excel sheet with headers named Header1, Header2, Header1, Header2 under columns A, B, C, D
df.read_excel(filename, usecols='C:D')
df.columns will return ['Header1.1', 'Header2.1']
Is there way to circumvent this, aside from splitting and joining the resulting headers? Especially when it is unknown whether there are duplicate columns it is tricky to rename them, as splitting on '.' may be corrupting a non-duplicate header.
Edit: additionally, the length (in indeces) of a DataFrame based on a subset of columns will be determined by the length of the full file. So if column A has 10 rows, and column B only has 5, a DataFrame generated by usecols='B' will have 10 rows of which 5 filled with NaN's.

Related

Creating a dataframe from several .txt files - each file being a row with 25 values

So, I have 7200 txt files, each with 25 lines. I would like to create a dataframe from them, with 7200 rows and 25 columns -- each line of the .txt file would be a value a column.
For that, first I have created a list column_names with length 25, and tested importing one single .txt file.
However, when I try this:
pd.read_csv('Data/fake-meta-information/1-meta.txt', delim_whitespace=True, names=column_names)
I get 25x25 dataframe, with values only on the first column. How do I read this into the dataframe in a way that I can get the txt lines to be imputed as values into the columns, and not imputing everything into the first column and creating 25 rows?
My next step would be creating a for loop to append each text file as a new row.
Probably something like this:
dir1 = *folder_path*
list = os.listdir(dir1)
number_files = len(list)
for i in range(number_files):
title = list[i]
df_temp = pd.read_csv(dir1 + title, delim_whitespace=True, names=column_names)
df = df.append(df_temp,ignore_index=True)
I hope I have been clear. Thank you all in advance!
read_csv generates a row per line in the source file but you want them to be columns. You could read the rows and pivot to columns, but since these files have a single value per line, you can just read them in numpy and use each resulting array as a row in a dataframe.
import numpy as np
import pandas as pd
from pathlib import Path
dir1 = Path(".")
df = pd.DataFrame([np.loadtxt(filename) for filename in dir1.glob("*.txt")])
print(df)
tdelaney's answer is probably "better" than mine, but if you want to keep your code more stylistically closer to what you are currently doing the following is another option.
You are getting your current output (25x25 with data in the first column only) because your read data is 25x1 but you are forcing the dataframe to have 25 columns with your names=column_names parameter.
To solve, just wait until the end to apply the column names:
Get a 25x1 df (drop the names param):
df_temp = pd.read_csv(dir1 + title, delim_whitespace=True)
Append the 25x1 df forming a 25x7200 df: df = df.append(df_temp,ignore_index=True)
Transpose the df forming the final 7200x25 df: df=df.T
Add column names: df.columns=column_names

Read multiple .csv files and extract (in new .csv files) all rows corresponding to non-empty cells across a specific column

I have multiple .csv files (about 250). Each of them have exactly the same columns. All of them, have enough empty cells across many of the columns. I am interested to extract only all rows corresponding to non-empty cells of a specific column (named 20201-2.0). I believe it will work better with pandas.
So far, I have done the following step, which would work if continued:
import pandas as pd
import glob
path = './'
column = ['20201-2.0']
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename,?,?)
li.append(df)
Is there a way I could extract only the rows corresponding to non-empty cells of column '20201-2.0' within df?
Or some other way?
George
df = pd.read_csv('myfile.csv').dropna(subset='20201-2.0')
If the cells are truly "empty" versus holding a space string (" ") or a zero, then they will contain a "NaN" (a true null). You should be able to get them with...
df = li[li['20201-2.0'].notnull()]
A more complete example...
import pandas as pd
import numpy as np
# Create the dataframe "li" with a bunch of random numbers
li = pd.DataFrame(np.random.randn(5,4), columns= ['Col1', 'Col2','20201-2.0', 'Col4'])
# Make one sepcific cell below the "20201-2.0" column a null (NaN) value
li['20201-2.0'].iloc[2] = np.NaN
print(li) # See what youÄre working with
# Select for all rows, in all columns where the column "20201-2.0" is not a null
# This will return a full dataframe, with all the rows and columns - excluding any row(s) where the cell below "20201-2.0" was null
df = li[li['20201-2.0'].notnull()]
print(df)

Removing rows of duplicate headers or strings same columns and blank lines in pandas in python

I have a sample data (Data_sample_truncated.txt) which I truncated from a big data. It has 3 fields - "Index", "Time" and "RxIn.Density[**x**, ::]" Here I used x as integer as x can vary for any range. In this data it is 0-15. The combination of the 3 column fields is unique. For different "Index" field the "Time" and "RxIn.Density[**x**, ::]" can be same or different. For each new "Index" value the data has a blank line and almost similar column headers except for "RxIn.Density[**x**, ::]" where x is increasing when new "Index" value is reached. The data which I export from ADS (circuit simulation software) gives me like this format while exporting.
Now I want to format the data so that all the data are merged together under 3 unique column fields - "Index", "Time" and "RxIn.Density". You can see I want to remove the strings [**x**, ::] in the new dataframe of the 3rd column. Here is the sample final data file that I want after formatting (Data-format_I_want_after_formatting.txt). So I want the following -
The blank lines (or rows) to be removed
All the other header lines to be removed keeping the top header only and changing the 3rd column header to "RxIn.Density"
Keeping all the data merged under the unique column fields - "Index", "Time" and "RxIn.Density", even if the data values are duplicate.
My MATLAB code is in the below:
import pandas as pd
#create DataFrame from csv with columns f and v
df = pd.read_csv('Data_sample_truncated.txt', sep="\s+", names=['index','time','v'])
#boolean mask for identify columns of new df
m = df['v'].str.contains('RxIn')
#new column by replace NaNs by forward filling
df['g'] = df['v'].where(m).ffill()
#get original ordering for new columns
#cols = df['g'].unique()
#remove rows with same values in v and g columns
#df = df[df['v'] != df['g']]
df = df.drop_duplicates(subset=['index', 'time'], keep=False)
df.to_csv('target.txt', index=False, sep='\t')
The generated target.txt file is not what I wanted. You can check it here. Can anyone help what is wrong with my code and what to do to fix it so that I wan my intended formatting?
I am using Spyder 3.2.6 (Anaconda) where python 3.6.4 64-bit is embedded.
You can just filter out rows, that you do not want(check this):
import pandas as pd
df = pd.read_csv('Data_sample_truncated.txt', sep="\s+")
df.columns = ["index","time","RxIn.Density","1"]
del df["1"]
df = df[~df["RxIn.Density"].str.contains("Rx")].reset_index(drop=True)
df.to_csv('target.txt', index=False, sep='\t')
Try this:
df = pd.read_csv('Data_sample_truncated.txt', sep='\s+', names=['index', 'time', 'RxIn.Density', 'mask'], header=None)
df = df[df['mask'].isna()].drop(['mask'], axis=1)
df.to_csv('target.txt', index=False, sep='\t')

Concat pandas dataframes without following a certain sequence

I have data files which are converted to pandas dataframes which sometimes share column names while others sharing time series index, which all I wish to combine as one dataframe based on both column and index whenever matching. Since there is no sequence in naming they appear randomly for concatenation. If two dataframe have different columns are concatenated along axis=1 it works well, but if the resulting dataframe is combined with new df with the column name from one of the earlier merged pandas dataframe, it fails to concat. For example with these data files :
import pandas as pd
df1 = pd.read_csv('0.csv', index_col=0, parse_dates=True, infer_datetime_format=True)
df2 = pd.read_csv('1.csv', index_col=0, parse_dates=True, infer_datetime_format=True)
df3 = pd.read_csv('2.csv', index_col=0, parse_dates=True, infer_datetime_format=True)
data1 = pd.DataFrame()
file_list = [df1, df2, df3] # fails
# file_list = [df2, df3,df1] # works
for fn in file_list:
if data1.empty==True or fn.columns[1] in data1.columns:
data1 = pd.concat([data1,fn])
else:
data1 = pd.concat([data1,fn], axis=1)
I get ValueError: Plan shapes are not aligned when I try to do that. In my case there is no way to first load all the DataFrames and check their column names. Having that I could combine all df with same column names to later only concat these resulting dataframes with different column names along axis=1 which I know always works as shown below. However, a solution which requires preloading all the DataFrames and rearranging the sequence of concatenation is not possible in my case (it was only done for a working example above). I need a flexibility in terms of in whichever sequence the information comes it can be concatenated with the larger dataframe data1. Please let me know if you have a suggested suitable approach.
If you go through the loop step by step, you can find that in the first iteration it goes into the if, so data1 is equal to df1. In the second iteration it goes to the else, since data1 is not empty and ''Temperature product barrel ValueY'' is not in data1.columns.
After the else, data1 has some duplicated column names. In every row of the duplicated column names. (one of the 2 columns is Nan, the other one is a float). This is the reason why pd.concat() fails.
You can aggregate the duplicate columns before you try to concatenate to get rid of it:
for fn in file_list:
if data1.empty==True or fn.columns[1] in data1.columns:
# new:
data1 = data1.groupby(data1.columns, axis=1).agg(np.nansum)
data1 = pd.concat([data1,fn])
else:
data1 = pd.concat([data1,fn], axis=1)
After that, you would get
data1.shape
(30, 23)

converting column names to integer with read_csv

I have constructed a matrix with integer values for columns and index. The matrix is acutally hierachical for each month. My problem is that the indexing and selecting of data does not work anymore as before when I write the data to csv and then load as pandas dataframe.
Selecting data before writing and reading data to file:
matrix.ix[1][4][3] would for example give 123
In words select, month January and get me the (travel) flow from origin 4 to destination 3.
After writing and reading the data to csv and back into pandas, the original referencing fails but if I convert the column indexing to string it works:
matrix.ix[1]['4'][3]
... the column names have automatically been tranformed from integer into string. But I would prefer the original indexing.
Any suggestions?
My current quick fix for handling the data after loading from csv is:
#Writing df to file
mulitindex_df_Travel_monthly.to_csv(r'result/Final_monthly_FlightData_countrylevel_v4.csv')
#Loading df from csv
test_matrix = pd.read_csv(filepath_inputdata+'/Final_monthly_FlightData_countrylevel_v4.csv',
index_col=[0, 1])
test_matrix.rename(columns = int, inplace = True) #Thx, #ayhan
CSV FILE:
https://www.dropbox.com/s/4u2opzh65zwcn81/travel_matrix_SO.csv?dl=0
I used something like this:
df = df.rename(columns={str(c): c for c in columns})
where:
df is pandas dataframe and columns are column to change
You could also do
df.columns = df.columns.astype(int)
or
df.columns = df.columns.map(int)
Related: what is difference between .map(str) and .astype(str) in dataframe

Categories