Concat Columns of Dataframe in python? - python

I have a data frame generate with the code as below:
# importing pandas as pd
import pandas as pd
# Create the dataframe
df = pd.DataFrame({'Category':['A', 'B', 'C', 'D'],
'Event':['Music Theater', 'Poetry Music', 'Theatre Comedy', 'Comedy Theatre'],
'Cost':[10000, 5000, 15000, 2000]})
# Print the dataframe
print(df)
I want a list to be generated combining all three columns and also removing whitespaces by "_" like and removing all trailing spaces too:-
[A_Music_Theater_10000, B_Poetry_Music_5000,C_Theatre_Comedy_15000,D_Comedy_Theatre_2000]
I want to it in most optimized way as running time is a issue for me. So looking to avoid for loops. Can anybody tell me how can i achieve this is most optimized way ?

The most general solution is convert all values to strings, use join and last replace:
df['new'] = df.astype(str).apply('_'.join, axis=1).str.replace(' ', '_')
If need filter only some columns:
cols = ['Category','Event','Cost']
df['new'] = df[cols].astype(str).apply('_'.join, axis=1).str.replace(' ', '_')
Or processing each columns separately - if necessary replace and also convert numeric column to strings:
df['new'] = (df['Category'] + '_' +
df['Event'].str.replace(' ', '_') + '_' +
df['Cost'].astype(str))
Or after converting to strings add _, sum, but necessary after replace remove traling _ by rstrip:
df['new'] = df.astype(str).add('_').sum(axis=1).str.replace(' ', '_').str.rstrip('_')
print(df)
Category Event Cost new
0 A Music Theater 10000 A_Music_Theater_10000
1 B Poetry Music 5000 B_Poetry_Music_5000
2 C Theatre Comedy 15000 C_Theatre_Comedy_15000
3 D Comedy Theatre 2000 D_Comedy_Theatre_2000

Related

Optimizing an Excel to Pandas import and transformation from wide to long data

I need to import and transform xlsx files. They are written in a wide format and I need to reproduce some of the cell information from each row and pair it up with information from all the other rows:
[Edit: changed format to represent the more complex requirements]
Source format
ID
Property
Activity1name
Activity1timestamp
Activity2name
Activity2timestamp
1
A
a
1.1.22 00:00
b
2.1.22 10:05
2
B
a
1.1.22 03:00
b
5.1.22 20:16
Target format
ID
Property
Activity
Timestamp
1
A
a
1.1.22 00:00
1
A
b
2.1.22 10:05
2
B
a
1.1.22 03:00
2
B
b
5.1.22 20:16
The following code works fine to transform the data, but the process is really, really slow:
def transform(data_in):
data = pd.DataFrame(columns=columns)
# Determine number of processes entered in a single row of the original file
steps_per_row = int((data_in.shape[1] - (len(columns) - 2)) / len(process_matching) + 1)
data_in = data_in.to_dict("records") # Convert to dict for speed optimization
for row_dict in tqdm(data_in): # Iterate over each row of the original file
new_row = {}
# Set common columns for each process step
for column in column_matching:
new_row[column] = row_dict[column_matching[column]]
for step in range(0, steps_per_row):
rep = str(step+1) if step > 0 else ""
# Iterate for as many times as there are process steps in one row of the original file and
# set specific columns for each process step, keeping common column values identical for current row
for column in process_matching:
new_row[column] = row_dict[process_matching[column]+rep]
data = data.append(new_row, ignore_index=True) # append dict of new_row to existing data
data.index.name = "SortKey"
data[timestamp].replace(r'.000', '', regex=True, inplace=True) # Remove trailing zeros from timestamp # TODO check if works as intended
data.replace(r'^\s*$', float('NaN'), regex=True, inplace=True) # Replace cells with only spaces with nan
data.dropna(axis=0, how="all", inplace=True) # Remove empty rows
data.dropna(axis=1, how="all", inplace=True) # Remove empty columns
data.dropna(axis=0, subset=[timestamp], inplace=True) # Drop rows with empty Timestamp
data.fillna('', inplace=True) # Replace NaN values with empty cells
return data
Obviously, iterating over each row and then even each column is not at all how to use pandas the right way, but I don't see how this kind of transformation can be vectorized.
I have tried using parallelization (modin) and played around with using dict or not, but it didn't work / help. The rest of the script literally just opens and saves the files, so the problem lies here.
I would be very grateful for any ideas on how to improve the speed!
The df.melt function should be able to do this type of operation much faster.
df = pd.DataFrame({'ID' : [1, 2],
'Property' : ['A', 'B'],
'Info1' : ['x', 'a'],
'Info2' : ['y', 'b'],
'Info3' : ['z', 'c'],
})
data=df.melt(id_vars=['ID','Property'], value_vars=['Info1', 'Info2', 'Info3'])
** Edit to address modified question **
Combine the df.melt with df.pivot operation.
# create data
df = pd.DataFrame({'ID' : [1, 2, 3],
'Property' : ['A', 'B', 'C'],
'Activity1name' : ['a', 'a', 'a'],
'Activity1timestamp' : ['1_1_22', '1_1_23', '1_1_24'],
'Activity2name' : ['b', 'b', 'b'],
'Activity2timestamp' : ['2_1_22', '2_1_23', '2_1_24'],
})
# melt dataframe
df_melted = df.melt(id_vars=['ID','Property'],
value_vars=['Activity1name', 'Activity1timestamp',
'Activity2name', 'Activity2timestamp',],
)
# merge categories, i.e. Activity1name Activity2name become Activity
df_melted.loc[df_melted['variable'].str.contains('name'), 'variable'] = 'Activity'
df_melted.loc[df_melted['variable'].str.contains('timestamp'),'variable'] = 'Timestamp'
# add category ids (dataframe may need to be sorted before this operation)
u_category_ids = np.arange(1,len(df_melted.variable.unique())+1)
category_ids = np.repeat(u_category_ids,len(df)*2).astype(str)
df_melted.insert(0, 'unique_id', df_melted['ID'].astype(str) +'_'+ category_ids)
# pivot table
table = df_melted.pivot_table(index=['unique_id','ID','Property',],
columns='variable', values='value',
aggfunc=lambda x: ' '.join(x))
table = table.reset_index().drop(['unique_id'], axis=1)
Using pd.melt, as suggested by #Pantelis, I was able to speed up this transformation so extremely much, it's unbelievable. Before, a file with ~13k rows took 4-5 hours on a brand-new ThinkPad X1 - now it takes less than 2 minutes! That's a speed up by factor 150, just wow. :)
Here's my new code, for inspiration / reference if anyone has a similar data structure:
def transform(data_in):
# Determine number of processes entered in a single row of the original file
steps_per_row = int((data_in.shape[1] - len(column_matching)) / len(process_matching) )
# Specify columns for pd.melt, transforming wide data format to long format
id_columns = column_matching.values()
var_names = {"Erledigungstermin Auftragsschrittbeschreibung":data_in["Auftragsschrittbeschreibung"].replace(" ", np.nan).dropna().values[0]}
var_columns = ["Erledigungstermin Auftragsschrittbeschreibung"]
for _ in range(2, steps_per_row+1):
try:
var_names["Erledigungstermin Auftragsschrittbeschreibung" + str(_)] = data_in["Auftragsschrittbeschreibung" + str(_)].replace(" ", np.nan).dropna().values[0]
except IndexError:
var_names["Erledigungstermin Auftragsschrittbeschreibung" + str(_)] = data_in.loc[0,"Auftragsschrittbeschreibung" + str(_)]
var_columns.append("Erledigungstermin Auftragsschrittbeschreibung" + str(_))
data = pd.melt(data_in, id_vars=id_columns, value_vars=var_columns, var_name="ActivityName", value_name=timestamp)
data.replace(var_names, inplace=True) # Replace "Erledigungstermin Auftragsschrittbeschreibung" with ActivityName
data.sort_values(["Auftrags-\npositionsnummer",timestamp], ascending=True, inplace=True)
# Improve column names
data.index.name = "SortKey"
column_names = {v: k for k, v in column_matching.items()}
data.rename(mapper=column_names, axis="columns", inplace=True)
data[timestamp].replace(r'.000', '', regex=True, inplace=True) # Remove trailing zeros from timestamp
data.replace(r'^\s*$', float('NaN'), regex=True, inplace=True) # Replace cells with only spaces with nan
data.dropna(axis=0, how="all", inplace=True) # Remove empty rows
data.dropna(axis=1, how="all", inplace=True) # Remove empty columns
data.dropna(axis=0, subset=[timestamp], inplace=True) # Drop rows with empty Timestamp
data.fillna('', inplace=True) # Replace NaN values with empty cells
return data

Rename columns in dataframe using bespoke function python pandas

I've got a data frame with column names like 'AH_AP' and 'AH_AS'.
Essentially all i want to do is swap the part before the underscore and the part after the underscore so that the column headers are 'AP_AH' and 'AS_AH'.
I can do that if the elements are in a list, but i've no idea how to get that to apply to column names.
My solution if it were a list goes like this:
columns = ['AH_AP','AS_AS']
def rejig_col_names():
elements_of_header = columns.split('_')
new_title = elements_of_header[-1] + "_" + elements_of_header[0]
return new_title
i'm guessing i need to apply this to something like the below, but i've no idea how, or how to reference a single column within df.columns:
df.columns = df.columns.map()
Any help appreciated. Thanks :)
You can do it this way:
Input:
df = pd.DataFrame(data=[['1','2'], ['3','4']], columns=['AH_PH', 'AH_AS'])
print(df)
AH_PH AH_AS
0 1 2
1 3 4
Output:
df.columns = df.columns.str.split('_').str[::-1].str.join('_')
print(df)
PH_AH AS_AH
0 1 2
1 3 4
Explained:
Use string accessor and the split method on '_'
Then using the str accessor with index slicing reversing, [::-1], you
can reverse the order of the list
Lastly, using the string accessor and join, we can concatenate the
list back together again.
You were almost there: you can do
df.columns = df.columns.map(rejig_col_names)
except that the function gets called with a column name as argument, so change it like this:
def rejig_col_names(col_name):
elements_of_header = col_name.split('_')
new_title = elements_of_header[-1] + "_" + elements_of_header[0]
return new_title
An alternative to the other answer. Using your function and DataFrame.rename
import pandas as pd
def rejig_col_names(columns):
elements_of_header = columns.split('_')
new_title = elements_of_header[-1] + "_" + elements_of_header[0]
return new_title
data = {
'A_B': [1, 2, 3],
'C_D': [4, 5, 6],
}
df = pd.DataFrame(data)
df.rename(rejig_col_names, axis='columns', inplace=True)
print(df)
str.replace is also an option via swapping capture groups:
Sample input borrowed from ScottBoston
df = pd.DataFrame(data=[['1', '2'], ['3', '4']], columns=['AH_PH', 'AH_AS'])
Then Capture everything before and after the '_' and swap capture group 1 and 2.
df.columns = df.columns.str.replace(r'^(.*)_(.*)$', r'\2_\1', regex=True)
PH_AH AS_AH
0 1 2
1 3 4

How to combine DataFame column data and fixed text string

I want to combine 4 columns within a larger DataFrame with a custom (space) delimiter (which I have done with the code below) but then I want to add a fixed string to the start and end of each concatenation.
The columns are pairs of X & Y coordinates, but they can be dealt with as str for this purpose (once I've trimmed to 3 decimal places).
I have found many options on this website for joining the columns, but none to join columns and a consistent fixed string.The lazy way would be for me to just make two more DataFrame columns, one for the start, one for the end, and cat everything. Is there a more sophisticated way to do it?
import pandas as pd
from pandas import DataFrame
import numpy as np
def str_join(df, sep, *cols):
from functools import reduce
return reduce (lambda x,y: x.astype(str).str.cat(y.astype(str), sep=sep),
[df[col] for col in cols])
data= pd.read_csv('/Users/XXXXXX/Desktop/Lines.csv')
df=pd.DataFrame(data, columns=['Name','SOLE','SOLN','EOLE','EOLN','EOLKP','Wind','Wave'])
df['SOLE']=round(df['SOLE'],3)
df['SOLN']=round(df['SOLN'],3)
df['EOLE']=round(df['EOLE'],3)
df['EOLN']=round(df['EOLN'],3)
df['WKT']=str_join(df,' ','SOLE','SOLN','EOLE','EOLN')
df.to_csv('OutLine.csv') #turn on to create output file
which gives me.
WKT
476912.131 6670122.285 470329.949 6676260.271
What I want to do is add '(LINESTRING ' to the start of each concatenation and ')' to the end of each to give me.
WKT
(LINESTRING 476912.131 6670122.285 470329.949 6676260.271 )
You could also create a collection of the columns you want to export, do a quick data type format, and apply a join.
target_cols = ['SOLE','SOLN','EOLE','EOLN',]
# Make sure to use along axis 1 (columns) because default is 0
# Also, if you're on Python 3.6+, I think you can use f-strings to format your floats.
df['WKT'] = df[target_cols].apply(lambda x: '(LINESTRING ' + ' '.join(f"{i:.3f}" for i in x) + ')', axis=1)
result:
In [0]: df.iloc[:,-3:]
Out [0]:
Wind Wave WKT
0 wind1 wave1 (LINESTRING 476912.131 6670122.285 470329.949 ...
** Sorry, I'm using Spyder, which is a terminal output miser. Here's a printout of 'WKT'
In [1]: print(df['WKT'].values)
Out [1]: ['(LINESTRING 476912.131 6670122.285 470329.949 6676260.271)']
* **EDIT: To add a comma after 'SOLN', we could use an alternative route:
target_cols = ['SOLE','SOLN','EOLE','EOLN',]
# Format strings in advance
# Set comma_col to our desired column name. This could also be a tuple for multiple names, then replace `==` with `in` in the loop below.
comma_col = 'SOLN'
# To find the last column, which doesn't need a space here, we just select the last value from our list. I did it this way in case our list order doesn't match the dataframe order.
last_col = df[target_cols].columns.values.tolist()[-1]
# Traditional if-then method
for col in df[target_cols]:
if col == comma_col:
df[col] = df[col].apply(lambda x: f"{x:.3f}" + ",") # Explicit comma
elif col == last_col:
df[col] = df[col].apply(lambda x: f"{x:.3f}")
else:
df[col] = df[col].apply(lambda x: f"{x:.3f}" + " ") # Explicit whitespace
# Adding our 'WKT' column as before, but the .join() portion doesn't have a space in it now.
df['WKT'] = df[target_cols].apply(lambda x: '(LINESTRING ' + ''.join(i for i in x) + ')', axis=1)
Finally:
In [0]: print(df['WKT'][0])
Out [0]: (LINESTRING 476912.131 6670122.286,470329.950 6676260.271)
Your function already looks good just you need to add few things:
def str_join(df, sep, *cols):
# All cols must be numeric to use df[col].round(3)
from functools import reduce
return reduce (lambda x,y: 'LINESTRING ' + x.astype(str).str.cat(y.astype(str) + ' )', sep=sep),
[df[col].round(3) for col in cols])
use it this way
df['new']='LINESTRING'
df['WKT']=pd.concat([df['new'],df['SOLE'],df['SOLN'],df['EOLE'],df['EOLN']])

Extracting the integers from a column of strings

I have 2 dataframes: longdf, and shortdf. Longdf is the ‘master’ list and I need to basically match values from shortdf to longdf, those that match, replace values in other columns. Both longdf and shortdf need extensive data cleaning.
The goal is to reach the df ‘goal.’ I was trying to use a for loop where I wanted to 1) extract all number in the df cell, and 2) strip the blank/cell spaces from the cell. First: How come this for loop doesn't work? Second: Is there a better way to do this?
import pandas as pd
a = pd.Series(['EY', 'BAIN', 'KPMG', 'EY'])
b = pd.Series([' 10wow this is terrible data8 ', '10/ USED TO BE ANOTHER NUMBER/ 2', ' OMG 106 OMG ', ' 10?7'])
y = pd.Series(['BAIN', 'KPMG', 'EY', 'EY' ])
z = pd.Series([108, 102, 106, 107 ])
goal = pd.DataFrame
shortdf = pd.DataFrame({'consultant': a, 'invoice_number':b})
longdf = shortdf.copy(deep=True)
goal = pd.DataFrame({'consultant': y, 'invoice_number':z})
shortinvoice = shortdf['invoice_number']
longinvoice = longdf['invoice_number']
frames = [shortinvoice, longinvoice]
new_list=[]
for eachitemer in frames:
eachitemer.str.extract('(\d+)').astype(float) #extracing all numbers in the df cell
eachitemer.str.strip() #strip the blank/whitespaces in between the numbers
new_list.append(eachitemer)
new_short_df = new_list[0]
new_long_df = new_list[1]
If I understand correctly, you want to take a series of strings that contain integers and remove all the characters that aren't integers. You don't need a for-loop for this. Instead, you can solve it with a simple regular expression.
b.replace('\D+', '', regex=True).astype(int)
Returns:
0 108
1 102
2 106
3 107
The regex replaces all characters that aren't numbers (denoted by \D) with an empty string, removing anything that's not a number. .astype(int) converts the series to the integer type. You can merge the result into your final dataframe as normal:
result = pd.DataFrame({
'consultant': a,
'invoice_number': b.replace('\D+', '', regex=True).astype(int)
})

Values in Pandas dataframe is mixed and shifted

Values in Pandas dataframe is mixed and shifted.But each column has its own characteristics for values in it. How can I rearrange values in their own position?
'floor_no' have to contain values with ' / ' substring in it.
'room_count' is maximum 2 values digit long.
sq_m_count' have to contain ' m²' substring in it.
'price_sq' have to contain ' USD/m²' in it.
'bs_state' have to contain one of 'Have' or 'Do not have' values.
Adding part of pandas dataframe.
Consider the following approach:
In [90]: dfs = []
In [91]: url = 'https://ru.bina.az/items/565674'
In [92]: dfs.append(pd.read_html(url)[0].set_index(0).T)
In [93]: url = 'https://ru.bina.az/items/551883'
In [94]: dfs.append(pd.read_html(url)[0].set_index(0).T)
In [95]: df = pd.concat(dfs, ignore_index=True)
In [96]: df
Out[96]:
0 Категория Площадь Количество комнат Купчая
0 Дом / Вилла 376 м² 6 есть
1 Дом / Вилла 605 м² 6 нет
I figured out solution that is bit "+18 and perverty"
I wrote a loop that looks if each of these columns contain some sting that identifies columnt that it belongs to and copies this value to new column. Then i simply subsituted new with old one.
I did this with each of 'mixed' columns. This code filled my needs and fixed all problem. I understand how 'perverted' code is and will write a function that is much shorter and professional.
for index in bina_az_df.itertuples():
bina_az_df.loc[bina_az_df['bs_state'].str.contains(" m²|sot"),'new_sq_m_count'] = bina_az_df['bs_state']
bina_az_df.loc[bina_az_df['sq_m_count'].str.contains(" m²|sot"),'new_sq_m_count'] = bina_az_df['sq_m_count']
bina_az_df.loc[bina_az_df['floor_no'].str.contains(" m²|sot"),'new_sq_m_count'] = bina_az_df['floor_no']
bina_az_df.loc[bina_az_df['price_sq'].str.contains(" m²|sot"),'new_sq_m_count'] = bina_az_df['price_sq']
bina_az_df.loc[bina_az_df['room_count'].str.contains(" m²|sot"),'new_sq_m_count'] = bina_az_df['room_count']
bina_az_df['sq_m_count'] = bina_az_df['new_sq_m_count'] # Substitutes
del bina_az_df['new_sq_m_count'] # deletes unnecesary temp column

Categories