I have an excel file that I read via :
df = pd.read_excel(path)
The problem is the encoding of the file when using pandas (while opening in Excel, everything is fine)
df
id group
_x0034_5109336 _x0020_N12
_x0035_4610785 _x0020_N32
_x0036_1987159 _x0020_N33
_x0034_6506844 _x0020_N41_x0020__x002F__x0020_N42
_x0033_8342845 _x0020_N23
I wanted to remove manually the xharacters:
df[col] = df[col].astype(str).str.replace('_x0020x', ' ')
But it might not be the best option..
BElow is the expected output
df
id group
45109336 N12
54610785 N32
61987159 N33
46506844 N41 / N42
38342845 N23
You'd have to use regex, something like _x.*[0-9]_
df[col] = df[col].astype(str).str.replace('_x.*[0-9]_', '', regex=True)
Related
For some reason, none of the solutions previously posted about this seem to answer my question.
I am reading in an excel page with 150+ sheets. I am looping through them and preparing the data to be concatenated together. (doing things like deleting unneeded/blank columns, and transforming some data) However, for some reason, I cannot get rid of any of the newline characters, no matter what I try. Here are some variations that I've tried so you can see what DIDN'T work.
import pandas as pd
import os
os.chdir(r'C:\Users\agray\Downloads')
sheets_dict = pd.read_excel('2022_Advanced_Control.xlsx', sheet_name=None)
df_list = list(sheets_dict.values())
df_list_clean = []
The top part stays the same, this loop portion is what changes.
for df in df_list:
df.columns = [c.replace(' ', '_') for c in df.columns]
df.drop(df.columns.difference(['Prescription_Drug_Name','Drug_Tier', 'Drug_Notes']), 1, inplace=True)
df.drop(df.tail(3).index, inplace=True)
df.loc[:, 'Prescription_Drug_Name'] = df.loc[:, 'Prescription_Drug_Name'].replace("\n", "", inplace=True)
df_list_clean.append(df)
This gives me a column that has nothing but blank values.
Here's another way I tried
for df in df_list:
df.columns = [c.replace(' ', '_') for c in df.columns]
df.drop(df.columns.difference(['Prescription_Drug_Name','Drug_Tier', 'Drug_Notes']), 1, inplace=True)
df.drop(df.tail(3).index, inplace=True)
df['Prescription_Drug_Name'] = df['Prescription_Drug_Name'].replace(r'\n','', regex=True, inplace=True)
df_list_clean.append(df)
This version is only applying to a copy, so none of the changes it says it's making are actually being made to my df. Any ideas how to get rid of all these "/n" characters in my column? Thanks!
Use str.strip():
df['Prescription_Drug_Name'] = df['Prescription_Drug_Name'].str.replace(r'\n', '')
I always advise against inplace=True. Make an explicit copy where you mean to.
This version is only applying to a copy, so none of the changes... being made to my df. Why don't you clone your data like this:
for df in df_list:
clean = df.copy()
clean.columns = [c.replace(' ', '_') for c in df.columns]
clean = clean.drop(df.columns.difference(['Prescription_Drug_Name','Drug_Tier', 'Drug_Notes']), 1)
# drop last three rows
clean = clean.iloc[:-3]
# modify column, remove `inplace` here
clean['Prescription_Drug_Name'] = clean['Prescription_Drug_Name'].replace(r'\n','', regex=True)
df_list_clean.append(clean)
That being said, all of the above can be chained, so you can do something like this:
for df in df_list
clean = (df.rename(columns=lambda x: x.replace(' ', '_'))
.reindex(['Prescription_Drug_Name','Drug_Tier', 'Drug_Notes'], axis=1).dropna(axis=0, how='all') # select only the columns
.iloc[:-3]
.assign(Prescription_Drug_Name=lambda x: x.replace(r'\n', '', regex=True)
)
df_list_clean.append(clean)
Hi there i am trying to figure out how to replace a specific data of csv file. i have a file which is base or location data of id's.
https://store8.gofile.io/download/5b031959-e0b0-4dbf-aec6-264e0b87fd09/service%20block.xlsx (sheet 2 had data ).
The file which i want to replace data using id is below
https://store8.gofile.io/download/6e13a19a-bac8-4d16-8692-e4435eed2a08/Serp.csv
Highlighted part need to be deleted after filling location.
import pandas as pd
df1= pd.read_excel("serp.xlsx", header=None)
df2= pd.read_excel("flocnam.xlsx", header=None)
df1 = df1[0].str.split(";", expand=True)
df1[4] = df1[4].apply(lambda x: v[-1] if (v := x.split()) else "")
df2[1] = df2[1].apply(lambda x: x.split("-")[0])
m = dict(zip(df2[1], df2[0]))
df1[4]= df1[4].replace(m)
print(df1)
df1.to_csv ("test.csv")
It worked but not how i wanted.
https://store8.gofile.io/download/c0ae7e05-c0e2-4f43-9d13-da12ddf73a8d/test.csv
trying to replace it like this.(desired output)
Thank you for being Supportive community❤️
If I understand correctly, you simply need to specify the separator ;
>>> df.to_csv(‘test.csv’, sep=‘;’, index_label=False)
In want to execute a regexp match on a dataframe column in order to modify the content of the column.
For example, given this dataframe:
import pandas as pd
df = pd.DataFrame([['abra'], ['charmender'], ['goku']],
columns=['Name'])
print(df.head())
I want to execute the following regex match:
CASE
WHEN REGEXP_MATCH(Landing Page,'abra') THEN "kadabra"
WHEN REGEXP_MATCH(Landing Page,'charmender') THEN "charmaleon"
ELSE "Unknown" END
My solution is the following:
df.loc[df['Name'].str.contains("abra", na=False), 'Name'] = "kadabra"
df.loc[df['Name'].str.contains("charmender", na=False), 'Name'] = "charmeleon"
df.head()
It works but I do not know if there is a better way of doing it.
Moreover, I have to rewrite all the regex cases line by line in Python. Is there a way to execute the regex directly in Pandas?
Are you looking for map:
df['Name'] = df['Name'].map({'abra':'kadabra','charmender':'charmeleon'})
Output:
Name
0 kadabra
1 charmeleon
2 NaN
Update: For partial matches:
df = pd.DataFrame([['this abra'], ['charmender'], ['goku']],
columns=['Name'])
replaces = {'abra':'kadabra','charmender':'charmeleon'}
df['Name'] = df['Name'].str.extract(fr"\b({'|'.join(replaces.keys())})\b")[0].map(replaces)
And you get the same output (with different dataframe)
I have a textfile that contains 2 columns of data. They are separated with unfix number of whitespace/s. I want to load it on a pandas DataFrame.
Example:
306.000000 1.125783
307.000000 0.008101
308.000000 -0.005917
309.000000 0.003784
310.000000 -0.516513
Please note that it also starts with whitespace/s.
My desired output would be like:
output = {'Wavelength': [306.000000, 307.000000, 308.000000, 309.000000, 310.000000],
'Reflectance': [1.125783, 0.008101, -0.005917, 0.003784, -0.516513]}
df = pd.DataFrame(data=output)
Use read_csv:
df = pd.read_csv('file.txt', sep='\\s+', names=['Wavelength', 'Reflectance'], header=None)
I need to read a standard CSV into a data frame, do some manipulations, and stringify the data frame into a specialized pipe separated format (text file). In order to comply with the file format, I have to add double quotes to the entire string in that cell (if it contains a pipe) before writing the final string to a file.
I wanted to leverage Pandas functions to accomplish this. I tried dabbling with the contains and format functions, but have not been successful.
Does anyone know of a simple way to accomplish this leveraging Pandas?
Expected Input:
colA,colB,colC,colD
cat,waverly way,foo,10.0
dog,smokey | st,foo,9.7
cow,rapid ave,foo,6.6
rabbit,far | blvd,foo,3.2
Expected Output:
cat|waverly way|foo|10.0/
dog|"smokey|st"|foo|9.7/
cow|rapid ave|foo|6.6/
rabbit|"far|blvd"|foo|3.2/
The "/" is intentional
You can use np.where & manipulate the matching string as below.
df['colB'] = np.where(df['colB'].str.contains('\|'),'"' + df['colB'] + '"' , df['colB'])
Note: Since only colB has the pipe (|) character the code above is written to check only that column & manipulate only that. If pipe (|) character is expected in other columns as well, you may to to repeat the code for other columns as well.
For colD you have to convert it into string(if it is not already as string) & add a forward slash as below
df['colD'] = df['colD'].astype(str) + '/'
Output
colA colB colC colD
0 cat waverly way foo 10.0/
1 dog "smokey | st" foo 9.7/
2 cow rapid ave foo 6.6/
3 rabbit "far | blvd" foo 3.2/
import pandas as pd
import csv
test = pd.read_csv("test.csv")
test.to_csv("final.csv", sep="|", quoting=csv.QUOTE_NONNUMERIC, line_terminator="/\n", header=False, index=False)
Here is the contents of "final.csv":
"cat"|"waverly way"|"foo"|10.0/
"dog"|"smokey | st"|"foo"|9.7/
"cow"|"rapid ave"|"foo"|6.6/
"rabbit"|"far | blvd"|"foo"|3.2/
Edit: this will add quotes all non-numeric strings. If you want quotes on only the values with pipes, you can remove the quoting parameter and use moy's solution:
import pandas as pd
import numpy as np
df = pd.read_csv("test.csv")
for col in list(df.select_dtypes(include=[object]).columns.values):
df[col] = np.where(df[col].str.contains('\|') & df[col].str.endswith('"') & df[col].str.startswith('"'),'"' + df[col] + '"', df[col])
df.to_csv("final.csv", sep="|", line_terminator="/\n", header=False, index=False)