match the dataframe with the list of columns names - python

I have two files, the first one contains the dataframe , without columns names:
2008-03-13 15 56 0 25
2008-03-14 10 32 27 45
2008-03-16 40 8 54 35
2008-03-18 40 8 63 30
2008-03-19 45 32 81 25
and another file, that contains the list of columns names (except of datetime column) in the following form:
output of file.read()
List(Group, Age, Income, Location)
In my real data, there are much more columns and columns names. Columns of dataframes are ordered as elements of list, i.e. the first column corresponds to Group, the third one to Income and the last one to Location, etc..
So my goal is to name the columns of my dataframe with the elements, containing in this file.
this operation will not work for obvious reasons (datetime columns are not contained in the list, and the list is not formatted in python form):
with open(file2) as f:
list_of_columns=f.read()
df=pd.read_csv(file1, sep='/t', names=list_of_columns)
and I already imagine the work of preprocessing with the removing the word List and () from the output of file2, and adding the column datetime in the head of the list, but if you have more elegant and quick solution, let me know!

you can do it this way:
import re
fn = r'D:\temp\.data\36972593_header.csv'
with open(fn) as f:
data = f.read()
# it will also tolerate if `List(...) is not in the first line`
cols = ['Date'] + re.sub(r'.*List\((.*)\).*', r'\1', data, flags=re.S|re.I|re.M).replace(' ', '').split(',')
fn = r'D:\temp\.data\36972593_data.csv'
# this will also parse `Date` column as `datetime`
df=pd.read_csv(fn, sep=r'\s+', names=cols, parse_dates=[0])
Result:
In [82]: df
Out[82]:
Date Group Age Income Location
0 2008-03-13 15 56 0 25
1 2008-03-14 10 32 27 45
2 2008-03-16 40 8 54 35
3 2008-03-18 40 8 63 30
4 2008-03-19 45 32 81 25
In [83]: df.dtypes
Out[83]:
Date datetime64[ns]
Group int64
Age int64
Income int64
Location int64
dtype: object

If the list of column names comes as a string in exactly this format, you could do:
with open(file2) as f:
list_of_columns=f.read()
list_of_columns = ['date'] + list_of_columns[5:-1].split(',')
list_of_columns = [l.strip() for l in list_of_columns] # remove leading/trailing whitespace
df=pd.read_csv(file1, sep='/t', names=list_of_columns)

Related

how to combine the first 2 column in pandas/python with n/a value

I have some questions about combining the first 2 columns in pandas/python with n/a value
long story: I need to read an excel and alter those changes. I can not change anything in excel, so any change has been done by python.
Here is the excel input
and the expected expect output will be
I manage to read it in, but when I try to combine the first 2 columns, I have some problems. since in excel, the first row is merged, so once it is read in. only one row has value, but the rest of the row is all N/A.
such as below:
Year number 2016
Month Jan
Month 2016-01
Grade 1 100
NaN 2 99
NaN 3 98
NaN 4 96
NaN 5 92
NaN Total 485
Is there any function that can easily help me to combine the first two columns and make it as below:
Year 2016
Month Jan
Month 2016-01
Grade 1 100
Grade 2 99
Grade 3 98
Grade 4 96
Grade 5 92
Grade Total 485
Anything will be really appreciated.
I searched and google the key word for so long but did not find any answer that fits my situation here.
d = '''
Year,number,2016
Month,,Jan
Month,,2016-01
Grade,1, 100
NaN,2, 99
NaN,3, 98
NaN,4, 96
NaN,5, 92
NaN,Total,485
'''
df = pd.read_csv(StringIO(d))
df
df['Year'] = df.Year.fillna(method='ffill')
df = df.fillna('') # skip this step if your data from excel does not have nan in col 2.
df['Year'] = df.Year + ' ' + df.number.astype('str')
df = df.drop('number',axis=1)
df

python sort a list of strings based on substrings using pandas

I have an excel sheet with 4 columns, Filename, SNR, Dynamic Range, Level.
Filename
SNR
Dynamic Range
Level
1___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS8_HPOF.xlsx
5
11
8
19___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS32_HPOF.xlsx
15
31
23
10___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS16_HPOF.xlsx
10
21
24
28___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS48_HPOF.xlsx
20
41
23
37___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS8_HP4.xlsx
25
51
12
I need to reorganize the first column of the table, Xls filename, such that the bolded part is in order from least to greatest.
i.e.
Filename
SNR
Dynamic Range
Level
1___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS8_HPOF.xlsx
5
11
8
37___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS8_HP4.xlsx
25
51
12
10___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS16_HPOF.xlsx
10
21
24
19___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS32_HPOF.xlsx
15
31
23
28___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS48_HPOF.xlsx
20
41
23
I don't want to change the actual excel file. I was hoping to use pandas because I am doing some other manipulation later on.
I tried this
df.sort_values(by='Xls Filename', key=lambda col: col.str.contains('_FS'),ascending=True)
but it didn't work.
Thank you in advance!
Extract the pattern, find the sort index using argsort and then sort with the sort index:
# extract the number to sort by into a Series
fs = df.Filename.str.extract('FS(\d+)_\w+\.xlsx$', expand=False)
# find the sort index using `argsort` and reorder data frame with the sort index
df.loc[fs.astype(int).argsort()]
# Filename ... Level
#0 1___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS8_HPOF.xlsx ... 8
#4 37___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS8_HP4.xlsx ... 12
#2 10___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS16_HPOF.xlsx ... 24
#1 19___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS32_HPOF.xlsx ... 23
#3 28___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS48_HPOF.xlsx ... 23
Where regex FS(\d+)_\w+\.xlsx$ will capture digits that immediately follow FS and precede _\w+\.xlsx.
In case you might have patterns that don't match, convert to float instead of int due to possible nans:
df.loc[fs.astype(float).values.argsort()]

Pandas - How to Extract Headers Which Are Contained Within Each Row?

I am a beginner with Pandas and I have a large dataset in an archaic format which I would like to wrangle into Pandas format. The data looks like this:
0 1 2 3 4 5 ...
0 ì 8=xx 9=00 35=8 49=YY 56=073 ...
1 8=xx 9=00 35=8 49=YY 56=073 34=10715 ...
2 8=xx 9=00 35=8 49=YY 56=073 34=10716 ...
...
The column headers are separated by "=" with header on the left and field on the right. Hence the data should look like this:
8 9 35 49 56 34 ...
0 xx 00 8 YY 073 107 ...
1 xx 00 8 YY 073 107 ...
2 xx 00 8 YY 073 107 ...
...
Each row has a different number of columns and there may be some repetition per row, for example, 8=xx may occur multiple times per row. I would like to create a new column (eg. 8_x, 8_y, ...) each time this happens. I have tried to formulate a for/iterrows() loop to iterate through each row but not sure how I can separate a string and set the header at one go.
I've tried to look for a similar issue on the site but no success so far. Any help is much appreciated!
Edit: Adding in the code I used to parse the initial raw data into the format in the first table.
import pandas as pd
df = pd.read_csv('File.dat', sep='\n',nrows = 2, header=None, encoding = "ANSI")
df = df[0].str.split('<SPECIAL CHAR.>', expand=True)
As mentioned above in one of the comments on the original post, the 'right' way to deal with this would be to parse the data before it's in a dataframe. That being said, once the data is in a dataframe you can use the following code:
rows = []
def parse_row(row):
d = {}
for item in row[1]:
if type(item) != str or "=" not in item:
continue # ignore this item
[col_name, val] = item.split("=")
if col_name in d:
inx = 0
while f"{col_name}_{inx}" in d:
inx += 1
col_name = f"{col_name}_{inx}"
print("new colname is {col_name}")
d[col_name] = val
return d
for row in df.iterrows():
rows.append(parse_row(row))
pd.DataFrame(rows)
I tested it with the following input:
0 1 2 3 4 5
0 ì 8=xx 9=00 35=8 49=YY 56=073
1 8=xx 9=00 35=8 49=YY 56=073 34=10715
2 8=xx 9=00 35=8 49=YY 8=zz 34=10716
This is the output:
8 9 35 49 56 34 8_0
0 xx 00 8 YY 073 NaN NaN
1 xx 00 8 YY 073 10715 NaN
2 xx 00 8 YY NaN 10716 zz
If the original .dat file is in a plain text format like one of the comment says it can be easily transformed into the CSV format:
Open the .dat file in your favorite text editor that support regular expressions.
Copy the first line and remove all occurrences of '=[^,]+' to create the header with column names.
From the 2nd line onward remove all occurrences of '[^,]=' to preserve the cell values.
Save the CSV file and open in Python with pd.read_csv(...).
This way every time you load the CSV chances are Pandas will guess the data format in each column correctly.
As mentioned above in one of the comments on the original post, the 'right' way to deal with this would be to parse the data before it's in a dataframe

How to Loop over Numeric Column in Pandas Dataframe and filter Values?

df:
Org_Name Emp_Name Age Salary
0 Axempl Rick 29 1000
1 Lastik John 34 2000
2 Xenon sidd 47 9000
3 Foxtrix Ammy thirty 2000
4 Hensaui giny 33 ten
5 menuia rony fifty 7000
6 lopex nick 23 Ninety
I want loop over Numeric Column (Age, Salary) to check each value whether it is numeric or not, if string value present in Numeric column filter out the record and create a new data frame without that error.
Output :
Org_Name Emp_Name Age Salary
0 Axempl Rick 29 1000
1 Lastik John 34 2000
2 Xenon sidd 47 9000
You could extend this answer to filter on multiple columns for numerical data types:
import pandas as pd
from io import StringIO
data = """
Org_Name,Emp_Name,Age,Salary
Axempl,Rick,29,1000
Lastik,John,34,2000
Xenon,sidd,47,9000
Foxtrix,Ammy,thirty,2000
Hensaui,giny,33,ten
menuia,rony,fifty,7000
lopex,nick,23,Ninety
"""
df = pd.read_csv(StringIO(data))
print('Original dataframe\n', df)
df = df[(df.Age.apply(lambda x: x.isnumeric())) &
(df.Salary.apply(lambda x: x.isnumeric()))]
print('Filtered dataframe\n', df)
gives
Original dataframe
Org_Name Emp_Name Age Salary
0 Axempl Rick 29 1000
1 Lastik John 34 2000
2 Xenon sidd 47 9000
3 Foxtrix Ammy thirty 2000
4 Hensaui giny 33 ten
5 menuia rony fifty 7000
6 lopex nick 23 Ninety
Filtered dataframe
Org_Name Emp_Name Age Salary
0 Axempl Rick 29 1000
1 Lastik John 34 2000
2 Xenon sidd 47 9000
I believe this can be solved using Pandas' "to_numeric" function.
import pandas as pd
df['Column to Check'] = pd.to_numeric(df['Column to Check'], downcast='integer', errors='coerce')
df.dropna(axis=0, inplace=True)
Where 'Column to Check' is the column name that your are checking for values that cannot be cast as an integer (or any numeric type); in your question I believe you will want to apply this code to 'Age' and 'Salary'. "to_numeric" will convert any values in those columns to NaN if they could not be cast as your selected type. The "dropna" method will remove all rows that have a NaN in any of your columns.
To loop over the columns like you ask, you could do the following:
for col in ['Age', 'Salary']:
df[col] = pd.to_numeric(df[col], downcast='integer', errors='coerce')
df.dropna(axis=0, inplace=True)
EDIT:
In response to harry's comment. If there are preexisting NaNs in the data, something like the following should keep any valid row that had a preexisting NaN in one of the other columns.
for col in ['Age', 'Salary']:
df[col] = pd.to_numeric(df[col], downcast='integer', errors='coerce')
df = df[df[col].notnull()]
You can use a mask to indicate wheter or not there is a string type among the Age and Salary columns:
mask_str = (df[['Age', 'Salary']]
.applymap(lambda x: str(type(x)))
.sum(axis=1)
.str.contains("str"))
df[~mask_str]
This is assuming that the dataframe already contains the proper types. If not, you can convert them using the following:
def convert(val):
try:
return int(val)
except ValueError:
return val
df = (df.assign(Age=lambda f: f.Age.apply(convert),
Salary=lambda f: f.Salary.apply(convert)))

Pandas read_csv adds unnecessary " " to each row

I have a csv file
(I am showing the first three rows here)
HEIGHT,WEIGHT,AGE,GENDER,SMOKES,ALCOHOL,EXERCISE,TRT,PULSE1,PULSE2,YEAR
173,57,18,2,2,1,2,2,86,88,93
179,58,19,2,2,1,2,1,82,150,93
I am using pandas read_csv to read the file and put them into columns.
Here is my code:
import pandas as pd
import os
path='~/Desktop/pulse.csv'
path=os.path.expanduser(path)
my_data=pd.read_csv(path, index_col=False, header=None, quoting = 3, delimiter=',')
print my_data
The problem is the first and last columns have " before and after the values.
Additionally I can't get rid of the indexes.
It might be making some silly mistake but I thank you for your help in advance
Final solution - use replace with converting to ints and for remove " from columns names use strip:
df = pd.read_csv('pulse.csv', quoting=3)
df = df.replace('"','', regex=True).astype(int)
df.columns = df.columns.str.strip('"')
print (df.head())
HEIGHT WEIGHT AGE GENDER SMOKES ALCOHOL EXERCISE TRT PULSE1 \
0 173 57 18 2 2 1 2 2 86
1 179 58 19 2 2 1 2 1 82
2 167 62 18 2 2 1 1 1 96
3 195 84 18 1 2 1 1 2 71
4 173 64 18 2 2 1 3 2 90
PULSE2 YEAR
0 88 93
1 150 93
2 176 93
3 73 93
4 88 93
index_col=False means force not read first column to index, but dataframe always need some index, so is added default - 0,1,2.... So here can be omit.
header=None should be removed because it force dont read first row (header of csv) to columns of DataFrame. Then also first row of data is header and numeric values are converted to strings.
delimiter=',' should be removed too, because it is same as sep=',' what is default parameter.
#jezrael is right - a pandas dataframe will always add indices. It's necessary.
try something like df[0] = df[0].str.strip() replacing zero with the last column.
before you do so, convert your csv to a dataframe - pd.DataFrame.from_csv(path)

Categories