Split a pandas dataframe header into multiple columns

Split a pandas dataframe header into multiple columns - python

I'm trying to split the dataframe header id;signin_count;status into more columns where I can put my data into. I've tried df.columns.values, but I couldn't get a string to use .split in, as I was hoping. Instead, I got:
Index(['id;signin_count;status'], dtype='object')
Which returns AttributeError: 'Index' object has no attribute 'split' when I try .split
In broader terms, I have:
id;signin_count;status
0 353;20;done;
1 374;94;pending;
2 377;4;done;
And want:
id signin_count status
0 353 20 done
1 374 94 pending
2 377 4 done
Splitting the data itself is not the problem here, that I can do. The focus is on how to access the header names without hardcoding it, as I will have to do the same with any other dataset with the same format
From the get-go, thank you

If you are reading your data from a csv file you can define sep to ; and read it as:
df=pd.read_csv('filename.csv', sep=';', index_col=False)
Output:
id signin_count status
0 353 20 done
1 374 94 pending
2 377 4 done

Related

Trying to get columns while reading multiple csv files. Only getting first two columns

I am reading all csv files in a folder (approx. 90 files). Each file has around 85 columns and I'm just interested in 2, so I'm trying to copy just these in my df. But the df I'm getting is only showing the first two columns.
The csv files look like this: csv file
My code:
csv_files = glob.glob(os.path.join("C:/User/Documents/Auswertung/2/Vent_2022/*.csv"))
frames = [pd.read_csv(file, sep=';', low_memory=False, usecols = ['LOCALTIME', 'Flow_filter'], names = ['LOCALTIME', 'Flow_filter']) for file in csv_files]
df_vent = pd.concat(frames, ignore_index = True)
df_vent.drop([0,1,2], axis=0, inplace=True)
display(df_vent)
What I'm trying to get:
LOCALTIME
Flow_filter
01.07.2022 00:01:00
69
24.07.2022 22:46:00
167
09.08.2022 15:14:00
38
06.09.2022 18:45:00
51
What I'm getting:
LOCALTIME
Flow_filter
01.07.2022 00:01:00
01.07.2022 00:01:00
24.07.2022 22:46:00
24.07.2022 22:46:00
09.08.2022 15:14:00
09.08.2022 15:14:00
06.09.2022 18:45:00
06.09.2022 18:45:00
Does someone know why this is happening and how I can correct it? Thanks in advance :)
EDIT
I followed a suggestion of removing
names = ['LOCALTIME', 'Flow_filter']
but know the df's first column is a mixture of column 1 and 3. Something like this:
LOCALTIME
Flow_filter
01.07.2022 00:01:00
69
24.07.2022 22:46:00
167
3
38
3
51
Here's a picture: odd df

When you pass the names = ['LOCALTIME', 'Flow_filter'] option to pd.read_csv, you are actually overriding the header row in the file, and thus saying that those are the names of the first two columns. Then you say pick those two columns, therefore the first two.
Since your file has a header row, simply removing that option will let pd.read_csv read the column names for you, and then usecols = ... should work as you expect.

Writing pandas column to csv without merging integers

I have extracted user_id against shop_ids as pandas dataframe from database using SQL query.
user_id shop_ids
0 022221205 541
1 023093087 5088,4460,4460,4460,4460,4460,4460,4460,5090
2 023096023 2053,2053,2053,2053,2053,2053,2053,2053,2053,1...
3 023096446 4339,4339,3966,4339,4339
4 023098684 5004,3604,5004,5749,5004
I am trying to write this dataframe into csv using:
df.to_csv('users_ordered_shops.csv')
I end up with the csv merging the shop ids into one number as such:
user_id shop_ids
0 22221205 541
1 23093087 508,844,604,460,446,000,000,000,000,000,000,000
2 23096023 2,053,205,320,532,050,000,000,000,000,000,000,000,000,000,000,000,000
3 23096446 43,394,339,396,643,300,000
4 23098684 50,043,604,500,457,400,000
The values for index 2 are:
print(df.iloc[2].shop_ids)
2053,2053,2053,2053,2053,2053,2053,2053,2053,1294,1294,2053,1922
Expected output is a csv file with all shop_ids intact in one column or different columns like:
user_id shop_ids
0 022221205 541
1 023093087 5088,4460,4460,4460,4460,4460,4460,4460,5090
2 023096023 2053,2053,2053,2053,2053,2053,2053,2053,2053,1294,1294,2053,1922
3 023096446 4339,4339,3966,4339,4339
4 023098684 5004,3604,5004,5749,5004
Any tips on how to get the shop ids without merging when writing to a csv file? I have tried converting the shop_ids column using astype() to int and str which has resulted in the same output.

Update
To get one shop per column (and remove duplicates), you can use:
pd.concat([df['user_id'],
df['shop_ids'].apply(lambda x: sorted(set(x.split(','))))
.apply(pd.Series)],
axis=1).to_csv('users_ordered_shops.csv', index=False)
Change the delimiter. Try:
df.to_csv('users_ordered_shops.csv', sep=';')
Or change the quoting strategy:
import csv
df.to_csv('users_ordered_shops.csv', quoting=csv.QUOTE_NONNUMERIC)

Pandas parsing excel file all in column A

I have a wireless radio readout that basically dumps all of the data into one column (column 'A') a of a spreadsheet (.xlsx). Is there anyway to parse the twenty plus columns into a dataframe for pandas? This is example of the data that is in column A of the excel file:
DSP ALLMSINFO:SECTORID=0,CARRIERID=0;
Belgium351G
+++ HUAWEI 2020-04-03 10:04:47 DST
O&M #4421590
%%/*35687*/DSP ALLMSINFO:SECTORID=0,CARRIERID=0;%%
RETCODE = 0 Operation succeeded
Display Information of All MSs-
------------------------------
Sector ID Carrier ID MSID MSSTATUS MSPWR(dBm) DLCINR(dB) ULCINR(dB) DLRSSI(dBm) ULRSSI(dBm) DLFEC ULFEC DLREPETITIONFATCTOR ULREPETITIONFATCTOR DLMIMOFLAG BENUM NRTPSNUM RTPSNUM ERTPSNUM UGSNUM UL PER for an MS(0.001) NI Value of the Band Where an MS Is Located(dBm) DL Traffic Rate for an MS(byte/s) UL Traffic Rate for an MS(byte/s)
0 0 0011-4D10-FFBA Enter -2 29 27 -56 -107 21 20 0 0 MIMO B 2 0 0 0 0 0 -134 158000 46000
0 0 501F-F63B-FB3B Enter 13 27 28 -68 -107 21 20 0 0 MIMO A 2 0 0 0 0 0 -134 12 8
Basically I just want to parse this data and have the table in a dataframe. Any help would be greatly appreciated.

You could try pandas read excel
df = pd.read_excel(filename, skip_rows=9)
This assumes we want to ignore the first 9 rows that don't make up the dataframe! Docs here https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

Load the excel file and split the column on the spaces.
A problem may occur with "DLMIMOFLAG" because it has a space in the data and this will cause it to be split over two columns. It's optional whether this is acceptable or if the columns are merged back together afterwards.
Add the header manually rather than load it, otherwise all the spaces in the header will confuse the loading & splitting routines.
import numpy as np
import pandas as pd
# Start on the first data row - row 10
# Make sure pandas knows that only data is being loaded by using
# header=None
df = pd.read_excel('radio.xlsx', skiprows=10, header=None)
This gives a dataframe that is only data, all held in one column.
To split these out, make sure pandas has a reference to the first column with df.iloc[:,0], split the column based on spaces with str.split() and inform pandas the output will be a numpy list values.tolist().
Together this looks like:
df2 = pd.DataFrame(df.iloc[:,0].str.split().values.tolist())
Note the example given has an extra column because of the space in "DLMIMOFLAG" causing it to be split over two columns. This will be referred to as "DLMIMOFLAG_A" and "DLMIMOFLAG_B".
Now add on the column headers.
Optionally create a list first.
column_names = ["Sector ID", "Carrier ID", "MSID", "MSSTATUS", "MSPWR(dBm)", "DLCINR(dB)", "ULCINR(dB)",
"DLRSSI(dBm)", "ULRSSI(dBm)", "DLFEC", "ULFEC", "DLREPETITIONFATCTOR", "ULREPETITIONFATCTOR",
"DLMIMOFLAG_A", "DLMIMOFLAG_B", "BENUM", "NRTPSNUM", "RTPSNUM", "ERTPSNUM", "UGSNUM",
"UL PER for an MS(0.001)", "NI Value of the Band Where an MS Is Located(dBm)",
"DL Traffic Rate for an MS(byte/s)", "UL Traffic Rate for an MS(byte/s)",]
df2.columns = column_names
This gives the output as a full dataframe with column headers.
Sector ID Carrier ID MSID MSSTATUS
0 0 0011-4D10-FFBA Enter
0 0 501F-F63B-FB3B Enter

Match index function from excel in pandas

There is a match index function in Excel that i use to match if the elements are present in the required column
=iferror(INDEX($B$2:$F$8,MATCH($J4,$B$2:$B$8,0),MATCH(K$3,$B$1:$F$1,0)),0)
This is the function i am using right now and it is yielding me good results but I want to implement it in python.
brand N Z None
Honor 63 96 190
Tecno 0 695 763
from this table I want
brand L N Z
Honor 0 63 96
Tecno 0 0 695
It should compare both the column and index and give the appropriate value
i have tried the lookup function in python but that gives me the
ValueError: Row labels must have same size as column labels

What you basically do with your excel formula, is creating something like a pivot table, you can also do that with pandas. E.g. like this:
# Define the columns and brands, you like to have in your result table
# along with the dataframe in variable df it's the only input
columns_query=['L', 'N', 'Z']
brands_query=['Honor', 'Tecno', 'Bar']
# no begin processing by selecting the columns
# which should be shown and are actually present
# add the brand, even if it was not selected
columns_present= {col for col in set(columns_query) if col in df.columns}
columns_present.add('brand')
# select the brands in question and take the
# info in columns we identified for these brands
# from this generate a "flat" list-like data
# structure using melt
# it contains records containing
# (brand, column-name and cell-value)
flat= df.loc[df['brand'].isin(brands_query), columns_present].melt(id_vars='brand')
# if you also want to see the columns and brands,
# for which you have no data in your original df
# you can use the following lines (if you don't
# need them, just skip the following lines until
# the next comment)
# the code just generates data points for the
# columns and rows, which would otherwise not be
# displayed and fills them wit NaN (the pandas
# equivalent for None)
columns_missing= set(columns_query).difference(columns_present)
brands_missing= set(brands_query).difference(df['brand'].unique())
num_dummies= max(len(brands_missing), len(columns_missing))
dummy_records= {
'brand': list(brands_missing) + [brands_query[0]] * (num_dummies - len(brands_missing)),
'variable': list(columns_missing) + [columns_query[0]] * (num_dummies - len(columns_missing)),
'value': [np.NaN] * num_dummies
}
dummy_records= pd.DataFrame(dummy_records)
flat= pd.concat([flat, dummy_records], axis='index', ignore_index=True)
# we get the result by the following line:
flat.set_index(['brand', 'variable']).unstack(level=-1)
For my testdata, this outputs:
value
variable L N Z
brand
Bar NaN NaN NaN
Honor NaN 63.0 96.0
Tecno NaN 0.0 695.0
The testdata is (note, that above we don't see col None and row Foo, but we see row Bar and column L, which are actually not present in the testdata, but were "queried"):
brand N Z None
0 Honor 63 96 190
1 Tecno 0 695 763
2 Foo 8 111 231
You can generate this testdata using:
import pandas as pd
import numpy as np
import io
raw=\
"""brand N Z None
Honor 63 96 190
Tecno 0 695 763
Foo 8 111 231"""
df= pd.read_csv(io.StringIO(raw), sep='\s+')
Note: the result as shown in the output is a regular pandas dataframe. So in case you plan to write the data back to a excel sheet, there should be no problem (pandas provides methods to read/write dataframes to/from excel-files).

Do you need to use Pandas for this action. You can do it with simple python as well. Read from one text file and print out matched and processed fields.
Basic file reading in Python goes like this. Where datafile.csv is your file. This reads all the lines in one file and prints out right result. First you need to save your file in .csv format so there is a separator between fields ','.
import csv # use csv
print('brand L N Z') # print new header
with open('datafile.csv', newline='') as csvfile:
spamreader = csv.reader(csvfile, delimiter=',', quotechar='"')
next(spamreader, None) # skip old header
for row in spamreader:
# You need to add Excel Match etc... logic here.
print(row[0], 0, row[1], row[2]) # print output
Input file:
brand,N,Z,None
Honor,63,96,190
Tecno,0,695,763
Prints out:
brand L N Z
Honor 0 63 96
Tecno 0 0 695
(I am not familiar with Excel Match-function so you may need to add some logic to above Python script to get logic working with all your data.)

Apply operation on columns of CSV file excluding headers and update results in last row

I have a CSV file created like this:
keep_same;get_max;get_min;get_avg
1213;176;901;517
1213;198;009;219
1213;898;201;532
Now I want the fourth row to get appended to the existing CSV file as followings:
First column: Remains same: 1213
Second column: Get max value: 898
Third column: Get min value: 009
Fourth column: Get avg value: 422.6
So the final CSV file should be:
keep_same;get_max;get_min;get_avg
1213;176;901;517
1213;198;009;219
1213;898;201;532
1213;898;009;422.6
Please help me to achieve the same. It's not mandatory to use Pandas.
Thanks in advance!

df.agg(...) accepts a dict where the dict keys are the names of the columns and the values are strings that perform an aggregation that you want:
df_agg = df.agg({'keep_same': 'mode', 'get_max': 'max',
'get_min': 'min', 'get_avg': 'mean'})[df.columns]
Produces:
keep_same get_max get_min get_avg
0 1213 898 9 422.666667
Then you just append df_agg to df:
df = df.append(df_agg, ignore_index=False)
Result:
keep_same get_max get_min get_avg
0 1213 176 901 517.000000
1 1213 198 9 219.000000
2 1213 898 201 532.000000
0 1213 898 9 422.666667
Notice that the index of the appended row is 0. You can pass ignore_index=True to append if you desire.
Also note that if you plan to do this append operation a lot, it will be very slow. Other approaches do exist in that case but for once-off or just a few times, append is OK.

assuming you do not care about the index you can use loc[-1] to add the row:
df = pd.read_csv('file.csv', sep=';', dtype={'get_min':'object'}) # read csv set dtype to object for leading 0 col
row = [df['keep_same'].values[0], df['get_max'].max(), df['get_min'].min(), df['get_avg'].mean()] # create new row
df.loc[-1] = row # add row to a new line
df['get_avg'] = df['get_avg'].round(1) # round to 1
df['get_avg'] = df['get_avg'].apply(lambda x: '%g'%(x)) # strip .0 from the other records
df.to_csv('file1.csv', index=False, sep=';') # to csv file
out:
keep_same;get_max;get_min;get_avg
1213;176;901;517
1213;198;009;219
1213;898;201;532
1213;898;009;422.7

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split a pandas dataframe header into multiple columns - python

If you are reading your data from a csv file you can define sep to ; and read it as: df=pd.read_csv('filename.csv', sep=';', index_col=False) Output: id signin_count status 0 353 20 done 1 374 94 pending 2 377 4 done

Related

Trying to get columns while reading multiple csv files. Only getting first two columns

Writing pandas column to csv without merging integers

Pandas parsing excel file all in column A

Match index function from excel in pandas

Apply operation on columns of CSV file excluding headers and update results in last row

Categories

Resources