Data Preprocessing in Python using Pandas

Data Preprocessing in Python using Pandas - python

I am trying to preprocess one of my columns in my Data frame. The issue is that I have [[ content1] , [content2], [content3]] in the relations column. I want to remove the Brackets
i have tried this following:
df['value'] = df['value'].str[0]
the output that i get is
[content 1]
df
print df
id value
1 [[str1],[str2],[str3]]
2 [[str4],[str5]]
3 [[str1]]
4 [[str8]]
5 [[str9]]
6 [[str4]]
the expected output should be like
id value
1 str1,str2,str3
2 str4,str5
3 str1
4 str8
5 str9
6 str4

It looks like you have lists of lists. You can try to unnest and join:
df['value'] = df['value'].apply(lambda x: ','.join([e for l in x for e in l]))
Or:
from itertools import chain
df['value'] = df['value'].apply(lambda x: ','.join(chain.from_iterable(x)))
NB. If you get an error, please provide it and the type of the column (df.dtypes)

As I could see, your data and sampling the same:
Sample Data:
df = pd.DataFrame({'id':[1,2,3,4,5,6], 'value':['[[str1],[str2],[str3]]', '[[str4],[str5]]', '[[str1]]', '[[str8]]', '[[str9]]', '[[str4]]']})
print(df)
id value
0 1 [[str1],[str2],[str3]]
1 2 [[str4],[str5]]
2 3 [[str1]]
3 4 [[str8]]
4 5 [[str9]]
5 6 [[str4]]
Result:
df['value'] = df['value'].str.replace('[', '').astype(str).str.replace(']', '')
print(df)
id value
0 1 str1,str2,str3
1 2 str4,str5
2 3 str1
3 4 str8
4 5 str9
5 6 str4
Note: as the error code says AttributeError: Can only use .str accessor with string values which means it's not treating it as str hence you may cast it to str by astype(str) and then do the replace operation.

You can use useful regex python package re.
This is the solution.
import pandas as pd
import re
make the test data
data = [
[1, '[[str1],[str2],[str3]]'],
[2, '[[str4],[str5]]'],
[3, '[[str1]]'],
[4, '[[str8]]'],
[5, '[[str9]]'],
[6, '[[str4]]']
]
conver data to Dataframe
df = pd.DataFrame(data, columns = ['id', 'value'])
print(df)
remove '[', ']' from the 'value' column
df['value']=df.apply(lambda x: re.sub("[\[\]]", "", x['value']),axis=1)
print(df)

Related

split the string in dataframe in python

I have a data-frame and one of its columns are a string which separated with dash. I want to get the part before the dash. Could you help me with that?
import pandas as pd
df = pd.DataFrame()
df['a'] = [1, 2, 3, 4, 5]
df['b'] = ['C-C02','R-C05','R-C01','C-C06', 'RC-C06']
The desire output is:

You could use str.replace to remove the - and all characters after it:
df['b'] = df['b'].str.replace(r'-.*$', '', regex=True)
Output:
a b
0 1 C
1 2 R
2 3 R
3 4 C
4 5 RC

You want to split each string on the '-' character and keep the part before it:
df['c'] = [s.split('-')[0] for s in df['b']]

How to write in excel/pandas Dataframe of two different variables value in same column

I'm new to pandas and trying to figure out how to add two different variables values in the same column.
import pandas as pd
import requests
from bs4 import BeautifulSoup
itemproducts = pd.DataFrame()
url = 'https://www.trwaftermarket.com/en/catalogue/product/BCH720/'
r = requests.get(url)
soup = BeautifulSoup(r.content,'lxml')
code_name = soup.find_all('div',{'class':'col-sm-6 intro-section reset-margin'})
for head in code_name:
item_code = head.find('span',{'class':'heading'}).text
item_name = head.find('span',{'class':'subheading'}).text
for tab_ in tab_4:
ab = tab_.find_all('td')
make_name1 = ab[0].text.replace('Make','')
code1 = ab[1].text.replace('OE Number','')
make_name2 = ab[2].text.replace('Make','')
code2 = ab[3].text.replace('OE Number','')
itemproducts=itemproducts.append({'CODE':item_code,
'NAME':item_name,
'MAKE':[make_name1,make_name2],
'OE NUMBER':[code1,code2]},ignore_index=True)
OUTPUT (Excel image)
What actually I want

In pandas you must specify all the data in the same length. So, in this case, I suggest that you specify each column or row as a fixed length list. For those that have one member less, append a NaN to match.

I found a similar question here on stackoverflow that can help you. Another approach is to use explode function from Pandas Dataframe.
Below I put an example from pandas documentation.
>>> df = pd.DataFrame({'A': [[1, 2, 3], 'foo', [], [3, 4]], 'B': 1})
>>> df
A B
0 [1, 2, 3] 1
1 foo 1
2 [] 1
3 [3, 4] 1
>>> df.explode('A')
A B
0 1 1
0 2 1
0 3 1
1 foo 1
2 NaN 1
3 3 1
3 4 1

I couldn't reproduce the results from your script. However, based on your end dataframe, perhpas you can make use of explode together with apply the dataframe in the end:
#creating your dataframe
itemproducts = pd.DataFrame({'CODE':'BCH720','MAKE':[['HONDA','HONDA']],'NAME':['Brake Caliper'],'OE NUMBER':[['43019-SAA-J51','43019-SAA-J50']]})
>>> itemproducts
CODE MAKE NAME OE NUMBER
0 BCH720 ['HONDA', 'HONDA'] Brake Caliper ['43019-SAA-J51', '43019-SAA-J50']
#using apply method with explode on 'MAKE' and 'OE NUMBER'
>>> itemproducts.apply(lambda x: x.explode() if x.name in ['MAKE', 'OE NUMBER'] else x)
CODE MAKE NAME OE NUMBER
0 BCH720 HONDA Brake Caliper 43019-SAA-J51
0 BCH720 HONDA Brake Caliper 43019-SAA-J50

Element-wise Comparison of Two Pandas Dataframes

I am trying to compare two columns in pandas. I know I can do:
# either using Pandas' equals()
df1[col].equals(df2[col])
# or this
df1[col] == df2[col]
However, what I am looking for is to compare these columns elment-wise and when they are not matching print out both values. I have tried:
if df1[col] != df2[col]:
print(df1[col])
print(df2[col])
where I get the error for 'The truth value of a Series is ambiguous'
I believe this is because the column is treated as a series of boolean values for the comparison which causes the ambiguity. I also tried various forms of for loops which did not resolve the issue.
Can anyone point me to how I should go about doing what I described?

This might work for you:
import pandas as pd
df1 = pd.DataFrame({'col1': [1, 2, 3, 4, 5]})
df2 = pd.DataFrame({'col1': [1, 2, 9, 4, 7]})
if not df2[df2['col1'] != df1['col1']].empty:
print(df1[df1['col1'] != df2['col1']])
print(df2[df2['col1'] != df1['col1']])
Output:
col1
2 3
4 5
col1
2 9
4 7

You need to get hold of the index where the column values are not matching. Once you have that index then you can query the individual DFs to get the values.
Please try the fallowing and is if this helps:
for ind in (df1.loc[df1['col1'] != df2['col1']].index):
x = df1.loc[df1.index == ind, 'col1'].values[0]
y = df2.loc[df2.index == ind, 'col1'].values[0]
print(x, y )

Solution
Try this. You could use any of the following one-line solutions.
# Option-1
df.loc[df.apply(lambda row: row[col1] != row[col2], axis=1), [col1, col2]]
# Option-2
df.loc[df[col1]!=df[col2], [col1, col2]]
Logic:
Option-1: We use pandas.DataFrame.apply() to evaluate the target columns row by row and pass the returned indices to df.loc[indices, [col1, col2]] and that returns the required set of rows where col1 != col2.
Option-2: We get the indices with df[col1] != df[col2] and the rest of the logic is the same as Option-1.
Dummy Data
I made the dummy data such that for indices: 2,6,8 we will find column 'a' and 'c' to be different. Thus, we want only those rows returned by the solution.
import numpy as np
import pandas as pd
a = np.arange(10)
c = a.copy()
c[[2,6,8]] = [0,20,40]
df = pd.DataFrame({'a': a, 'b': a**2, 'c': c})
print(df)
Output:
a b c
0 0 0 0
1 1 1 1
2 2 4 0
3 3 9 3
4 4 16 4
5 5 25 5
6 6 36 20
7 7 49 7
8 8 64 40
9 9 81 9
Applying the solution to the dummy data
We see that the solution proposed returns the result as expected.
col1, col2 = 'a', 'c'
result = df.loc[df.apply(lambda row: row[col1] != row[col2], axis=1), [col1, col2]]
print(result)
Output:
a c
2 2 0
6 6 20
8 8 40

Loading Json into Pandas dataframe

I have a valid json file with the following format that I am trying to load into pandas.
{
"testvalues": [
[1424754000000, 0.7413],
[1424840400000, 0.7375],
[1424926800000, 0.7344],
[1425013200000, 0.7375],
[1425272400000, 0.7422],
[1425358800000, 0.7427]
]
}
There is a Pandas function called read_json() that takes in json files/buffers and spits out the dataframe but I have not been able to get it to load correctly, which is to show two columns rather than a single column with elements looking like [1424754000000, 0.7413]. I have tried different 'orient' and 'typ' to no avail. What options should I pass into the function to get it to spit out a two column dataframe corresponding the timestamp and the value?

You can use list comprehension with DataFrame contructor:
import pandas as pd
df = pd.read_json('file.json')
print df
testvalues
0 [1424754000000, 0.7413]
1 [1424840400000, 0.7375]
2 [1424926800000, 0.7344]
3 [1425013200000, 0.7375]
4 [1425272400000, 0.7422]
5 [1425358800000, 0.7427]
print pd.DataFrame([x for x in df['testvalues']], columns=['a','b'])
a b
0 1424754000000 0.7413
1 1424840400000 0.7375
2 1424926800000 0.7344
3 1425013200000 0.7375
4 1425272400000 0.7422
5 1425358800000 0.7427

I'm not sure about pandas read_json but IIUC you could do that with astype(str), str.split, str.strip:
d = {
"testvalues": [
[1424754000000, 0.7413],
[1424840400000, 0.7375],
[1424926800000, 0.7344],
[1425013200000, 0.7375],
[1425272400000, 0.7422],
[1425358800000, 0.7427]
]
}
df = pd.DataFrame(d)
res = df.testvalues.astype(str).str.strip('[]').str.split(', ', expand=True)
In [112]: df
Out[112]:
testvalues
0 [1424754000000, 0.7413]
1 [1424840400000, 0.7375]
2 [1424926800000, 0.7344]
3 [1425013200000, 0.7375]
4 [1425272400000, 0.7422]
5 [1425358800000, 0.7427]
In [113]: res
Out[113]:
0 1
0 1424754000000 0.7413
1 1424840400000 0.7375
2 1424926800000 0.7344
3 1425013200000 0.7375
4 1425272400000 0.7422
5 1425358800000 0.7427

You can apply a function that splits it into a pd.Series.
Say you start with
df = pd.read_json(s)
Then just apply a splitting function:
>>> df.apply(
lambda r: pd.Series({'l': r[0][0], 'r': r[0][1]}),
axis=1)
l r
0 1.424754e+12 0.7413
1 1.424840e+12 0.7375
2 1.424927e+12 0.7344
3 1.425013e+12 0.7375
4 1.425272e+12 0.7422
5 1.425359e+12 0.7427

Pandas: how to find and concatenate values

I'm trying to replace and add some values in pandas dataframe object. I have to following code
import pandas as pd
df = pd.DataFrame.from_items([('A', ["va-lue", "value-%", "value"]), ('B', [4, 5, 6])])
print df
df['A'] = df['A'].str.replace('%', '_0')
print df
df['A'] = df['A'].str.replace('-', '')
print df
#allmost there?
df.A[df['A'].str.contains('-')] + "_0"
How can I find the cell values in column A which contains '-' sign, replace this value with '' and add for these values a trailing '_0'? The resulting data set should look like this
A B
0 value_0 4
1 value_0 5
2 value 6

You can first keep track of the rows whose A needs to be appended with the trailing string, and perform these operations in two steps:
mask = df['A'].str.contains('-')
df['A'] = df['A'].str.replace('-|%', '')
df.ix[mask, 'A'] += '_0'
print df
Output:
A B
0 value_0 4
1 value_0 5
2 value 6

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Data Preprocessing in Python using Pandas - python

Related

split the string in dataframe in python

How to write in excel/pandas Dataframe of two different variables value in same column

Element-wise Comparison of Two Pandas Dataframes

Loading Json into Pandas dataframe

Pandas: how to find and concatenate values

Categories

Resources