How to replace informations in a dataFrame using Pandas and Python? - python

I need some help:
How could I update the column of the file file_csv_reference.csv dataFrame using Pandas and Python?
file_csv_reference.csv:
cod_example
123456
123456
123456
789101
789101
121314
121314
there are lines with repeated information, I would like to replace all of them with the respective updated code in the file bellow:
file_with_updated_cod.csv
old_cod updated_cod
123456 ;1234567
789101 ;7891011
121314 ;1213141
Until now I'm thinking throught this way (but I can't do it run right):
import pandas as pd
file01 = pd.read_csv("file_csv_reference.csv", encoding = "utf-8", delimiter = ";", header = 0)
file02 = pd.read_csv("file_with_updated_cod.csv", encoding = "utf-8", delimiter = ";", header = 0)
for oldcod in file01['cod_example']:
for cod in file02['old_cod']:
if oldcod == cod:
#in this part I would like to replace the data in the file01 column cod_example
# with file01['updated_cod'] in the respective column
Could you help me please to solve this situation? Thank's!

You can use .map:
df1 = pd.read_csv("file_csv_reference.csv")
df2 = pd.read_csv("file_with_updated_cod.csv", sep=";")
df1["cod_example"] = df1["cod_example"].map(
df2.set_index("old_cod")["updated_cod"]
)
print(df1)
Prints:
cod_example
0 1234567
1 1234567
2 1234567
3 7891011
4 7891011
5 1213141
6 1213141

Related

Python add column names and split rows to columns

I need help with coding.
I wrote code to get last 2 rows from csv file and after that saving it to another file.
The code looks like this:
with open(outputFileName,"r") as f:
reader = csv.reader(f,delimiter = ",")
data = list(reader)
row_count = len(data)
df = pd.read_csv(outputFileName, skiprows = row_count - 2)
df.to_csv('D:\koreguoti.csv', index=False)
Data in file now looks like: (but without names Column1 and Column2. I jus want to show you, that information is in diferent columns)
Column1 | Column2
2021.03.17 12:00:00 P+ 0 | 644.0
0 2021.03.17 12:00:00 P- 0 | 6735.0
So I need to have it in this format (with names of columns):
Date | Time | P | Value
0 2021.03.17 | 12:00:00 | P+| 644.0
1 2021.03.17 | 12:00:00 | P-| 6735.0
Could anybody to help me?
I'd do a text = split(csv_file) and keep only what I want then reorganise them.
For exemple if you have :
Column 1 Column 2
12 15
You do text = split(csv_file) so it gives you text = ["Column", "1", "Column", "2", "12", "15"]
And you just take the two last ones and do a
print("Month : Day :\n\
{} {}".format(text[4], text[5])
and that's it.
Of course you need to change some things until it works for you.
Solved, by working arround
df['0'] = ['no']
df['1'] = ['no']
df['2'] = ['no']
df.to_csv('D:\koreguoti1.csv', index=False)
#---------------------------------------------------------------------------
#Rename column names
df = pd.read_csv('D:\koreguoti1.csv', header=None)
df.rename(columns={0: 'Data',1: 'Laikas', 2: 'P', 3: 'Nulis', 4: 'Verte'}, inplace=True)
# Copy values from one column to another
df['Verte'] = df['Laikas']
# Split first columns to 4 columns
split_data = df["Data"].str.split(" ")
data = split_data.to_list()
names = ["Data", "Laikas", "P", "Nulis"]
new_df = pd.DataFrame(data, columns=names)
new_df.insert(4, "Verte", 0)
# adding needed column
new_df['Verte'] = df['Laikas']
# Deleting not needed column "Nulis"
del new_df['Nulis']
#print(new_df)
# Save everything to new file
new_df.to_csv('D:\sutvarkyti.csv', index=False)

Modify column data before a specific character using Regex in pandas

I'm trying to modify the Address column data by removing all the characters before the comma.
Sample data:
**ADDRESS**
0 Ksfc Layout,Bangalore
1 Vishweshwara Nagar,Mysore
2 Jigani,Bangalore
3 Sector-1 Vaishali,Ghaziabad
4 New Town,Kolkata
Expected Output:
**ADDRESS**
0 Bangalore
1 Mysore
2 Bangalore
3 Ghaziabad
4 Kolkata
I tried this code but it's not working can someone correct the code?
import pandas as pd
import regex as re
data = pd.read_csv("train.csv")
data.ADDRESS.replace(re.sub(r'.*,',"", data.ADDRESS), regex=True, inplace=True)
Try this:
data.ADDRESS = data.ADDRESS.str.split(',').str[-1]
You can do it without a regex:
def removeFirst(x):
return x.split(",")[-1]
df['ADDRESS'] = df['ADDRESS'].apply(removeFirst)
You can try like this without Regex:
data['ADDRESS'] = data['ADDRESS'].str.split(',').str[-1]
Use Series.str.replace:
data['ADDRESS'] = data['ADDRESS'].str.replace(r'.*,', '')
See proof

Specify length (padding) of each column in a csv

I am trying to re-arrange a file to match a BACS bank format. In order for it to work the columns in the csv below need to be of a specific length. I have figured out the abcdabcd column as it's a repeating pattern (as are a couple more in the file), but several columns have random numbers that I cannot easily target.
Is there a way for me to target either (ideally) a specific column based on its header, or alternatively target everything up to a comma to butcher something that could work?
In my example file below, you'll see three columns where the value changes. If targeting everything up to a specific character is the solution, I was thinking of using .ljust to fill the column up to the specified length (and then sorting it out manually in excel).
Original File
a,b,c,d,e,f,g,h,i,j,k
12345,1234567,0,11,123456,12345678,1234567,abcdabcd,A ABCD
123456,12345678,0,11,123456,12345678,12345678,abcdabcd,A ABCD
123456,1234567,0,11,123456,12345678,12345,abcdabcd,A ABCD
12345,1234567,0,11,123456,12345678,1234567,abcdabcd,A ABCD
123456,12345678,0,11,123456,12345678,123456789,abcdabcd,A ABCD
Ideal output
a,b,c,d,e,f,g,h,i,j,k
123450,12345670,0,11,123456,12345678,123456700,abcdabcd,A ABCD
123456,12345678,0,11,123456,12345678,123456780,abcdabcd,A ABCD
123456,12345670,0,11,123456,12345678,123450000,abcdabcd,A ABCD
123450,12345670,0,11,123456,12345678,123456700,abcdabcd,A ABCD
123456,12345678,0,11,123456,12345678,123456789,abcdabcd,A ABCD
Code
with open('file.txt', 'r') as file :
filedata = file.read()
filedata = filedata.replace('12345', '12345'.ljust(6, '0'))
with open('file.txt', 'w') as file:
file.write(filedata)
EDIT:
Something similar to this Python - How to add zeros to and integer/string? but while either targeting a specific column per header, or at least the first one.
EDIT2:
I am using the below to rearrange my columns, could this be modified to work with string lengths?
import pandas as pd
## Read csv / tab-delimited in this example
df = pd.read_csv('test.txt', sep='\t')
## Reorder columns
df = df[['h','i','c','g','a','b','e','d','f','j','k']]
## Write csv / tab-delimited
df.to_csv('test', sep='\t')
Using pandas, you can convert the column to str and then use .str.pad. You can make a dict with the requested lengths:
lengths = {
"a": 6,
"b": 8,
"c": 3,
"d": 6,
"e": 8,
}
and use it like this:
result = pd.DataFrame(
{
column_name: column.str.pad(
lengths.get(column_name, 0), side="right", fillchar="0"
)
for column_name, column in df.astype(str).items()
}
)
If the fillchar is different per column, you can get that from a dict as well
>>> print '{:0>5}'.format(4)
'00004'
>>> print '{:0<5}'.format(4)
'40000'
>>> print '{:0^5}'.format(4)
'00400'
Example:
#--------------DEFs------------------
def number_zero_right(number,len_number):
return ('{:0<'+str(len_number)+'}').format(number)
#--------------MAIN------------------
a = 12345
b = 1234567
c = 0
d = 11
e = 123456
f = 12345678
g = 1234567
h = 'abcdabcd'
i = 'A'
j = 'ABCD'
print(a,b,c,d,e,f,g,h,i,j)
# > 12345 1234567 0 11 123456 12345678 1234567 abcdabcd A ABCD
a = number_zero_right(a,6)
b = number_zero_right(b,8)
c = number_zero_right(c,1)
d = number_zero_right(d,2)
e = number_zero_right(e,6)
f = number_zero_right(f,8)
g = number_zero_right(g,9)
print(a,b,c,d,e,f,g,h,i,j)
#> 123450 12345670 0 11 123456 12345678 123456700 abcdabcd A ABCD
Managed to get there, so thought I'd post in case someone has a similar issue. This only works on one column, but that's enough for me now.
#import pandas
import pandas as pd
#open file and convert data to str
data = pd.read_csv('Test.CSV', dtype = str)
# width of output string
width = 6
# fillchar
char ="_"
#Change the contents of column named ColumnID
data["ColumnID"]= data["ColumnID"].str.ljust(width, char)
#print output
print(data)

Python Pandas: Dataframe is not updating using string methods

I'm trying to update the strings in a .csv file that I am reading using Pandas. The .csv contains the column name 'about' which contains the rows of data I want to manipulate.
I've already used str. to update but it is not reflecting in the exported .csv file. Some of my code can be seen below.
import pandas as pd
df = pd.read_csv('data.csv')
df.About.str.lower() #About is the column I am trying to update
df.About.str.replace('[^a-zA-Z ]', '')
df.to_csv('newdata.csv')
You need assign output to column, also is possible chain both operation together, because working with same column About and because values are converted to lowercase, is possible change regex to replace not uppercase:
df = pd.read_csv('data.csv')
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
df.to_csv('newdata.csv', index=False)
Sample:
df = pd.DataFrame({'About':['AaSD14%', 'SDD Aa']})
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
print (df)
About
0 aasd
1 sdd aa
import pandas as pd
import numpy as np
columns = ['About']
data = ["ALPHA","OMEGA","ALpHOmGA"]
df = pd.DataFrame(data, columns=columns)
df.About = df.About.str.lower().str.replace('[^a-zA-Z ]', '')
print(df)
OUTPUT:
Example Dataframe:
>>> df
About
0 JOHN23
1 PINKO22
2 MERRY jen
3 Soojan San
4 Remo55
Solution:,another way Using a compiled regex with flags
>>> df.About.str.lower().str.replace(regex_pat, '')
0 john
1 pinko
2 merry jen
3 soojan san
4 remo
Name: About, dtype: object
Explanation:
Match a single character not present in the list below [^a-z]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy) a-z a single character in
the range between a (index 97) and z (index 122) (case sensitive)
$ asserts position at the end of a line

df.query while using pandas in python produces empty results

I am learning how to manipulate data using pandas in python. I got the following script:
import pandas as pd
df = pd.read_table( "t.txt" ) #read in the file
df.columns = [x.strip() for x in df.columns] #strip spaces in headers
df = df.query('TLD == ".biz"') #select the rows where TLD == ".biz"
df.to_csv('t.txt', sep='\t') #write the output to a tab-separated file
but the output file has no records, headers only. When I check using
print.df
prior to the selection, the output is:
TLD Length Words \
0 .biz 5 ...
1 .biz 4 ...
2 .biz 5 ...
3 .biz 5 ...
4 .biz 3 ...
5 .biz 3 ...
6 .biz 6 ...
So I know the column TLD has rows with the .biz values. I also tried :
>>> print(df.loc[df['TLD'] == '.biz'])
but the results is
Empty DataFrame
With list of my columns
What am I doing wrong please?
It seems some whitespaces are there, so need remove them by strip:
print(df.loc[df['TLD'].str.strip() == '.biz'])
df['TLD'] = df['TLD'].str.strip()
df = df.query('TLD == ".biz"')

Categories