How to transform pandas column values to dict with column name - python

I need help on how to properly transform my df from this:
df = pd.DataFrame({'ID': ['ID no1', "ID no2", "ID no3"],
'ValueM2': ["11998","11076", "12025"],
'ValueSqFt': [129145.39718,119221.07178, 129.43600276]})
to this: --> also i need it to be outputted as double quote (") instead of single quote (')
dfnew = pd.DataFrame({'ID': ["ID no1", "ID no2", "ID no3"],
'DataMetric': [{"ValueM2": "11998"}, {"ValueM2": "11076"}, {"ValueM2": "12025"}],
'DataImperial': [{"ValueSqFt": "129145.39718"}, {"ValueSqFt": "119221.07178"}, {"ValueSqFt": "129.43600276"}]})

If there are only 2 columns to be manipulated, it is best to adopt a manual approach as follows:
df['ValueM2'] = [{'ValueM2': x} for x in df['ValueM2'].values]
df['ValueSqFt'] = [{"ValueSqFt": x} for x in df['ValueSqFt'].values]
df = df.rename(columns={'ValueM2': 'DataMetric', 'ValueSqFt': 'DataImperial'})
If you want to have the output with double quotes, you can use json.dumps:
import json
df['DataMetric'] = df['DataMetric'].apply(lambda x: json.dumps(x))
df['DataImperial'] = df['DataImperial'].apply(lambda x: json.dumps(x))
or
df['DataMetric'] = df['DataMetric'].astype(str).apply(lambda x: x.replace("'", '"'))
df['DataImperial'] = df['DataImperial'].astype(str).apply(lambda x: x.replace("'", '"'))
but this will convert the date type to string!

Related

Removing a comma at end a each row in python

I have the below dataframe
After doing the below manipulations to the dataframe, I am getting the output in the Rule column with comma at the end which is expected .but I want to remove it .How to do it
df['Rule'] = df.State.apply(lambda x: str("'"+str(x)+"',"))
df['Rule'] = df.groupby(['Description'])['Rule'].transform(lambda x: ' '.join(x))
df1 = df.drop_duplicates('Description',keep = 'first')
df1['Rule'] = df1['Rule'].apply(lambda x: str("("+str(x)+")")
I have tried it using ilo[-1].replace(",",""). But it is not working .
Try this:
df['Rule'] = df.State.apply(lambda x: str("'"+str(x)+"'"))
df['Rule'] = df.groupby(['Description'])['Rule'].transform(lambda x: ', '.join(x))
df1 = df.drop_duplicates('Description', keep = 'first')
df1['Rule'] = df1['Rule'].apply(lambda x: str("("+str(x)+")"))

dataframe if string contains substring divide by 1000

I have this dataframe looking like the below dataframe.
import pandas as pd
data = [['yellow', '800test' ], ['red','900ui'], ['blue','900test'], ['indigo','700test'], ['black','400ui']]
df = pd.DataFrame(data, columns = ['Donor', 'value'])
In the value field, if a string contains say 'test', I'd like to divide these numbers by 1000. What would be the best way to do this?
Check Below code using lambda function
df['value_2'] = df.apply(lambda x: str(int(x.value.replace('test',''))/1000)+'test' if x.value.find('test') > -1 else x.value, axis=1)
df
Output:
df["value\d"] = df.value.str.findall("\d+").str[0].astype(int)
df["value\w"] = df.value.str.findall("[^\d]+").str[0]
df.loc[df["value\w"] == "test", "value\d"] = df["value\d"]/1000
df["value"] = df["value\w"] + df["value\d"].astype(str)

Pandas reorder raw content

I do have the following Excel-File
Which I've converted it to DataFrame and dropped 2 columns using below code:
df = pd.read_excel(self.file)
df.drop(['Name', 'Scopus ID'], axis=1, inplace=True)
Now, My target is to switch all names orders within the df.
For example,
the first name is Adedokun, Babatunde Olubayo
which i would like to convert it to Babatunde Olubayo Adedokun
how to do that for the entire df whatever name is it?
Split the name and reconcat them.
import pandas as pd
data = {'Name': ['Adedokun, Babatunde Olubayo', "Uwizeye, Dieudonné"]}
df = pd.DataFrame(data)
def swap_name(name):
name = name.split(', ')
return name[1] + ' ' + name[0]
df['Name'] = df['Name'].apply(swap_name)
df
Output:
> Name
> 0 Babatunde Olubayo Adedokun
> 1 Dieudonné Uwizeye
Let's assume you want to do the operation on "Other Names 1":
df.loc[:, "Other Names1"] = df["Other Names1"].str.split(",").apply(lambda row: " ".join(row))
You can use str accessor:
df['Name'] = df['Name'].str.split(', ').str[::-1].str.join(' ')
print(df)
# Output
Name
0 Babatunde Olubayo Adedokun
1 Dieudonné Uwizeye

Converting DataFrame containing UTF-8 and Nulls to String Without Losing Data

Here's my code for reading in this dataframe:
html = 'https://www.agroindustria.gob.ar/sitio/areas/ss_mercados_agropecuarios/logistica/_archivos/000023_Posici%C3%B3n%20de%20Camiones%20y%20Vagones/000010_Entrada%20de%20camiones%20y%20vagones%20a%20puertos%20semanal%20y%20mensual.php'
url = urlopen(html)
df = pd.read_html(html, encoding = 'utf-8')
remove = []
for x in range(len(df)):
if len(df[x]) < 10:
remove.append(x)
for x in remove[::-1]:
df.pop(x)
df = df[0]
The dataframe contained uses both ',' and '.' as thousands indicators, and i want neither. So 5.103 should be 5103.
Using this code:
df = df.apply(lambda x: x.str.replace('.', ''))
df = df.apply(lambda x: x.str.replace(',', ''))
All of the data will get changed, but the values in the last four columns will all turn to NaN. I'm assuming this has something to do with trying to use str.replace on a float?
Trying any sort of df[column] = df[column].astype(str) also gives back errors, as does something convoluted like the following:
for x in df.columns.tolist():
for k, v in df[x].iteritems():
if pd.isnull(v) == False and type(v) = float:
df.loc(k, df[x]) == str(v)
What is the right way to approach this problem?
You can try this regex approach. I haven't tested it, but it should work.
df = df.apply(lambda x: re.sub(r'(\d+)[.,](\d+)',r'\1\2',str(x)))

Formatting a specific row of integers to the ssn style

I want to format a specific column of integers to ssn format (xxx-xx-xxxx). I saw that openpyxl has builtin styles. I have been using pandas and wasn't sure if it could do this specific format.
I did see this -
df.iloc[:,:].str.replace(',', '')
but I want to replace the ',' with '-'.
import pandas as pd
df = pd.read_excel('C:/Python/Python37/Files/Original.xls')
df.drop(['StartDate', 'EndDate','EmployeeID'], axis = 1, inplace=True)
df.rename(columns={'CheckNumber': 'W/E Date', 'CheckBranch': 'Branch','DeductionAmount':'Amount'},inplace=True)
df = df[['Branch','Deduction','CheckDate','W/E Date','SSN','LastName','FirstName','Amount','Agency','CaseNumber']]
ssn = (df['SSN'] # the integer column
.astype(str) # cast integers to string
.str.zfill(8) # zero-padding
.pipe(lambda s: s.str[:2] + '-' + s.str[2:4] + '-' + s.str[4:]))
writer = pd.ExcelWriter('C:/Python/Python37/Files/Deductions Report.xlsx')
df.to_excel(writer,'Sheet1')
writer.save()
Your question is a bit confusing, see if this helps:
If you have a column of integers and you want to create a new one made up of strings in SSN (Social Security Number) format. You can try something like:
df['SSN'] = (df['SSN'] # the "integer" column
.astype(int) # the integer column
.astype(str) # cast integers to string
.str.zfill(9) # zero-padding
.pipe(lambda s: s.str[:3] + '-' + s.str[3:5] + '-' + s.str[5:]))
Setup
Social Security numbers are nine-digit numbers using the form: AAA-GG-SSSS
s = pd.Series([111223333, 222334444])
0 111223333
1 222334444
dtype: int64
Option 1
Using zip and numpy.unravel_index:
pd.Series([
'{}-{}-{}'.format(*el)
for el in zip(*np.unravel_index(s, (1000,100,10000)))
])
Option 2
Using f-strings:
pd.Series([f'{i[:3]}-{i[3:5]}-{i[5:]}' for i in s.astype(str)])
Both produce:
0 111-22-3333
1 222-33-4444
dtype: object
I prefer:
df["ssn"] = df["ssn"].astype(str)
df["ssn"] = df["ssn"].str.strip()
df["ssn"] = (
df.ssn.str.replace("(", "")
.str.replace(")", "")
.str.replace("-", "")
.str.replace(" ", "")
.apply(lambda x: f"{x[:3]}-{x[3:5]}-{x[5:]}")
)
This take into account rows that are partially formatted, fully formatted, or not formatted and correctly formats them all.
For Example:
data = [111111111,123456789,"222-11-3333","433-3131234"]
df = pd.DataFrame(data, columns=['ssn'])
Gives you:
Before
After the code you then get:
After

Categories