Replace Matching Strings with New Text in Pandas Dataframe - python

I am trying to replace names in the "Name" column with a generic ID and make a new column "research_code", the "Name" column will then be removed.
I do not want to remove duplicates, but I do want all instances of "Buzz Lightyear" to be replaced by the same integer (i.e 1). So all "Buzz Lightyears" are "1" all "Twighlight Sparkle's" are "2". etc
When I run this, I get no errors, but the "research_code" does not persist for some reason.
full_set = pd.read_csv(filename, index_col=None, header=0)
grouped_set = full_set.groupby('Name')
names = grouped_set.groups.keys()
idx = 1
for c in names:
set_index = str(idx + 1)
idx = int(set_index) + 1
replaceables = full_set[(full_set.Name == str(c))]
for index, row in replaceables.iterrows():
print(row['Name'])
print(row['research_code'])
row['research_code'] = set_index
print(row['research_code'])
print(full_set.head)

Categories can be used.
import pandas as pd
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
filename = StringIO("""Name
Rahul
Doug
Joe
Buzzlightyear
Twighlight Sparkle
Twighlight Sparkle
Liu
""")
full_set = pd.read_csv(filename, index_col=None, header=0)
full_set['research_code'] = full_set['Name'].astype('category')
full_set['research_code'] = full_set['research_code'].cat.rename_categories([i for i in range(full_set['research_code'].nunique())])
print(full_set.drop(['Name'], axis=1))
That last bit on the list comprehension is a bit gratuitous. Just rename the categories by providing rename_categories() a list of new names (integers in the above Question) that is as long as the number of unique values in the Names column.
research_code
0 4
1 1
2 2
3 0
4 5
5 5
6 3

Related

How to save values in pandas dataframe after editing some values

I have a dataframe which looks like this (It contains dummy data) -
I want to remove the text which occurs after "_________" identifier in each of the cells. I have written the code as follows (Logic: Adding a new column containing NaN and saving the edited values in that column) -
import pandas as pd
import numpy as np
df = pd.read_excel(r'Desktop\Trial.xlsx')
NaN = np.nan
df["Body2"] = NaN
substring = "____________"
for index, row in df.iterrows():
if substring in row["Body"]:
split_string = row["Body"].split(substring,1)
row["Body2"] = split_string[0]
print(df)
But the Body2 column still displays NaN and not the edited values.
Any help would be much appreciated!
`for index, row in df.iterrows():
if substring in row["Body"]:
split_string = row["Body"].split(substring,1)
#row["Body2"] = split_string[0] # instead use below line
df.at[index,'Body2'] = split_string[0]`
Make use of at to modify the value
Instead of iterating through the rows, do the operation on all rows at once. You can use expand to split the values into multiple columns, which I think is what you want.
substring = "____________"
df = pd.DataFrame({'Body': ['a____________b', 'c____________d', 'e____________f', 'gh']})
df[['Body1', 'Body2']] = df['Body'].str.split(substring, expand=True)
print(df)
# Body Body1 Body2
# 0 a____________b a b
# 1 c____________d c d
# 2 e____________f e f
# 3 gh gh None

.replace codes will not replace column with new column in python

I am trying to read a column in python, and create a new column using python.
import pandas as pd
df = pd.read_csv (r'C:\Users\User\Documents\Research\seqadv.csv')
print (df)
df = pd.DataFrame(data={'WT_RESIDUE':['']})
codes = {'ALA':'A', 'ARG':'R', 'ASN':'N', 'ASP':'D', 'CYS':'C', 'GLU':'E', 'GLN':'Q', 'GLY':'G', 'HIS':'H', 'ILE':'I', 'LEU':'L', 'LYS':'K', 'MET':'M', 'PHE':'F', 'PRO':'P', 'SER':'S', 'THR':'T', 'TRP':'W', 'TYR':'Y', 'VAL':'V'}
df['MUTATION_CODE'] = df['WT_RESIDUE'].replace(codes)
df.to_csv (r'C:\Users\User\Documents\Research\output.csv')
I tried this, but it will not create a new column no matter what I do.
example
It seems like you made a silly mistake
import pandas as pd
df = pd.read_csv (r'C:\Users\User\Documents\Research\seqadv.csv')
print (df)
df = pd.DataFrame(data={'WT_RESIDUE':['']}) # Why do you have this line?
codes = {'ALA':'A', 'ARG':'R', 'ASN':'N', 'ASP':'D', 'CYS':'C', 'GLU':'E', 'GLN':'Q', 'GLY':'G', 'HIS':'H', 'ILE':'I', 'LEU':'L', 'LYS':'K', 'MET':'M', 'PHE':'F', 'PRO':'P', 'SER':'S', 'THR':'T', 'TRP':'W', 'TYR':'Y', 'VAL':'V'}
df['MUTATION_CODE'] = df['WT_RESIDUE'].replace(codes)
df.to_csv (r'C:\Users\User\Documents\Research\output.csv')
Try removing the line with the comment. AFAIK, it is reinitializing your DataFrame and thus the WT_RESIDUE column becomes empty.
Considering sample from provided input.
We can use map function to map the keys of dict to existing column and persist corresponding values in new column.
df = pd.DataFrame({
'WT_RESIDUE':['ALA', "REMARK", 'VAL', "LYS"]
})
codes = {'ALA':'A', 'ARG':'R', 'ASN':'N', 'ASP':'D', 'CYS':'C', 'GLU':'E', 'GLN':'Q', 'GLY':'G', 'HIS':'H', 'ILE':'I', 'LEU':'L', 'LYS':'K', 'MET':'M', 'PHE':'F', 'PRO':'P', 'SER':'S', 'THR':'T', 'TRP':'W', 'TYR':'Y', 'VAL':'V'}
df['MUTATION_CODE'] = df.WT_RESIDUE.map(codes)
Input
WT_RESIDUE
0 ALA
1 REMARK
2 VAL
3 LYS
Output
WT_RESIDUE MUTATION_CODE
0 ALA A
1 REMARK NaN
2 VAL V
3 LYS K

Splitting text and numbers in dataframe in python

I have a dataframe df with column name 'col' as the second column and the data looks like:
Dataframe
Want to separate text part in one column with name "Casing Size" and numerical part with "DepthTo" in other column.
Desired Output
import pandas as pd
import io
from google.colab import files
uploaded = files.upload()
df = pd.read_excel(io.BytesIO(uploaded['Test-Checking.xlsx']))
#Method 1
df2 = pd.DataFrame(data=df, columns=['col'])
df2 = df2.col.str.extract('([a-zA-Z]+)([^a-zA-Z]+)', expand=True)
df2.columns = ['CasingSize', 'DepthTo']
df2
#Method 2
def split_col(x):
try:
numb = float(x.split()[0])
txt = x.split()[1]
except:
numb = float(x.split()[1])
txt = x.split()[0]
x['col1'] = txt
x['col2'] = numb
df2['col1'] = df.col.apply(split_col)
df2
Tried two methods but none of them work correctly. Is there anyone help me?
Code in Google Colab
Excel File Attached
Try this
first you need to return the the values from your functions. then you can unpack them into your columns using the to_list()
def sample(x):
b,y=x.split()
return b,y
temp_df=df2['col'].apply(sample)
df2[['col1','col2']]=pd.DataFrame(temp_df.tolist())
You could try splitting the values into a list, then sorting them so that the numerical part comes first. Then you could apply pd.Series and assign back to the two columns.
import pandas as pd
df = pd.DataFrame({'col':["PWT 69.2", '283.5 HWT', '62.9 PWT', '284 HWT']})
df[['Casing Size','DepthTO']] = df['col'].str.split().apply(lambda x: sorted(x)).apply(pd.Series)
print(df)
Output
col Casing Size DepthTO
0 PWT 69.2 69.2 PWT
1 283.5 HWT 283.5 HWT
2 62.9 PWT 62.9 PWT
3 284 HWT 284 HWT

Concatenating .mtx files and changing counter for cell IDs

I have several files that look like this, where the header is the count of unique values per column.
How can I read several of these files and concatenate them all in one??
When I concatenate, I need that all the values in the column in the middle ADD the total value of count of that column from the file before, to continue with the count when I concatenate. The other two columns I don't mind.
My try:
matrixFiles = glob.glob(filesPath +'/*matrix.mtx')
dfs = []
i = 0
for file in sorted(matrixFiles):
matrix = pd.read_csv(file, sep = ' ')
cellNumber = matrix.columns[1]
cellNumberInt = np.int64(cellNumber)
if i > 0:
matrix.iloc[:,1] = matrix.iloc[:,1] + cellNumberInt
dfs.append(matrix)
i = i + 1
big_file = pd.concat (dfs)
I don't know how to access to cellNumberInt from the file iterated before to add it to the new one.
When I concat dfs the output is not a three column dataframe. How can I concatenate all the files in the same columns and avoiding the header?
1.csv:
33694,1298,2465341
33665,1299,20
33663,1299,8
2.csv:
53694,1398,3465341
33665,1399,20
33663,1399,8
3.csv:
13694,7778,3465341
44432,7780,20
33663,7780,8
import pandas as pd
import numpy as np
matrixFiles = ['1.csv', '2.csv', '3.csv']
dfs = []
matrix_list = []
#this dict stores the i number (keys) and the cellNumberInt (values)
cellNumberInt_dict = {}
i = 0
for file in sorted(matrixFiles):
matrix = pd.read_csv(file)
cellNumber = matrix.columns[1]
cellNumberInt = np.int64(cellNumber)
cellNumberInt_dict[i] = cellNumberInt
if i > 0:
matrix.rename(columns={str(cellNumberInt) : cellNumberInt + cellNumberInt_dict[i-1]}, inplace=True)
dfs.append(matrix)
if i < len(matrixFiles)-1:
#we only want to keep the df values here, keeping the columns that don't
# have shared names messes up the pd.concat()
matrix_list.append(matrix.values)
i += 1
# get the last df in the dfs list because it has the last cellNumberInt
last_df = dfs[-1]
#concat all of the values from the dfs except for the last one
arrs = np.concatenate(matrix_list)
#make a df from the numpy arrays
new_df = pd.DataFrame(arrs, columns=last_df.columns.tolist())
big_file = pd.concat([last_df, new_df])
big_file.rename(columns={big_file.columns.tolist()[1] : sum(cellNumberInt_dict.values())}, inplace=True)
print (big_file)
13694 10474 3465341
0 44432 7780 20
1 33663 7780 8
0 33665 1299 20
1 33663 1299 8
2 33665 1399 20
3 33663 1399 8

How does one append large amounts of data to a Pandas HDFStore and get a natural unique index?

I'm importing large amounts of http logs (80GB+) into a Pandas HDFStore for statistical processing. Even within a single import file I need to batch the content as I load it. My tactic thus far has been to read the parsed lines into a DataFrame then store the DataFrame into the HDFStore. My goal is to have the index key unique for a single key in the DataStore but each DataFrame restarts it's own index value again. I was anticipating HDFStore.append() would have some mechanism to tell it to ignore the DataFrame index values and just keep adding to my HDFStore key's existing index values but cannot seem to find it. How do I import DataFrames and ignore the index values contained therein while having the HDFStore increment its existing index values? Sample code below batches every 10 lines. Naturally the real thing would be larger.
if hd_file_name:
"""
HDF5 output file specified.
"""
hdf_output = pd.HDFStore(hd_file_name, complib='blosc')
print hdf_output
columns = ['source', 'ip', 'unknown', 'user', 'timestamp', 'http_verb', 'path', 'protocol', 'http_result',
'response_size', 'referrer', 'user_agent', 'response_time']
source_name = str(log_file.name.rsplit('/')[-1]) # HDF5 Tables don't play nice with unicode so explicit str(). :(
batch = []
for count, line in enumerate(log_file,1):
data = parse_line(line, rejected_output = reject_output)
# Add our source file name to the beginning.
data.insert(0, source_name )
batch.append(data)
if not (count % 10):
df = pd.DataFrame( batch, columns = columns )
hdf_output.append(KEY_NAME, df)
batch = []
if (count % 10):
df = pd.DataFrame( batch, columns = columns )
hdf_output.append(KEY_NAME, df)
You can do it like this. Only trick is that the first time the store table doesn't exist, so get_storer will raise.
import pandas as pd
import numpy as np
import os
files = ['test1.csv','test2.csv']
for f in files:
pd.DataFrame(np.random.randn(10,2),columns=list('AB')).to_csv(f)
path = 'test.h5'
if os.path.exists(path):
os.remove(path)
with pd.get_store(path) as store:
for f in files:
df = pd.read_csv(f,index_col=0)
try:
nrows = store.get_storer('foo').nrows
except:
nrows = 0
df.index = pd.Series(df.index) + nrows
store.append('foo',df)
In [10]: pd.read_hdf('test.h5','foo')
Out[10]:
A B
0 0.772017 0.153381
1 0.304131 0.368573
2 0.995465 0.799655
3 -0.326959 0.923280
4 -0.808376 0.449645
5 -1.336166 0.236968
6 -0.593523 -0.359080
7 -0.098482 0.037183
8 0.315627 -1.027162
9 -1.084545 -1.922288
10 0.412407 -0.270916
11 1.835381 -0.737411
12 -0.607571 0.507790
13 0.043509 -0.294086
14 -0.465210 0.880798
15 1.181344 0.354411
16 0.501892 -0.358361
17 0.633256 0.419397
18 0.932354 -0.603932
19 -0.341135 2.453220
You actually don't necessarily need a global unique index, (unless you want one) as HDFStore (through PyTables) provides one by uniquely numbering rows. You can always add these selection parameters.
In [11]: pd.read_hdf('test.h5','foo',start=12,stop=15)
Out[11]:
A B
12 -0.607571 0.507790
13 0.043509 -0.294086
14 -0.465210 0.880798

Categories