Iterating through pandas string index turned them into floats - python

I have a csv file:
SID done good_ecg good_gsr good_resp comment
436 0 1 1
2411 1 1 1
3858 0 1 1
4517 0 1 1 117 min diff between files
9458 1 0 1 ######### error in my script
9754 0 1 1 trigger fehler
#REF!
88.8888888889
Which I load in a pandas dataframe it like this:
df = pandas.read_csv(f ,delimiter="\t", dtype="str", index_col='SID')
I want to iterate through the index and print each one. But when I try
for subj in df.index:
print subj
I get
436.0
2411.0
...
Now there is this '.0' at the end of each number. What am I doing wrong?
I have also tried iterating with iterrows() and have the same problem.
Thank you for any help!
EDIT: Here is the whole code I am using:
import pandas
def write():
df = pandas.read_csv("overview.csv" ,delimiter="\t", dtype="str", index_col='SID')
for subj in df.index:
print subj
write()

Ah. The dtype parameter doesn't apply to the index_col:
>>> !cat sindex.csv
a,b,c
123,50,R
234,51,R
>>> df = pd.read_csv("sindex.csv", dtype="str", index_col="a")
>>> df
b c
a
123 50 R
234 51 R
>>> df.index
Int64Index([123, 234], dtype='int64', name='a')
Instead, read it in without an index_col (None is actually the default, so you don't need index_col=None at all, but here I'll be explicit) and then set the index:
>>> df = pd.read_csv("sindex.csv", dtype="str", index_col=None)
>>> df = df.set_index("a")
>>> df
b c
a
123 50 R
234 51 R
>>> df.index
Index(['123', '234'], dtype='object', name='a')
(I can't think of circumstances under which df.index would have dtype object but when you iterate over it you'd get integers, but you didn't actually show any self-contained code that generated that problem.)

Related

check if column is blank in pandas dataframe

I have the next csv file:
A|B|C
1100|8718|2021-11-21
1104|21|
I want to create a dataframe that gives me the date output as follows:
A B C
0 1100 8718 20211121000000
1 1104 21 ""
This means
if C is empty:
put doublequotes
else:
format date to yyyymmddhhmmss (adding 0s to hhmmss)
My code:
df['C'] = np.where(df['C'].empty, df['C'].str.replace('', '""'), df['C'] + '000000')
but it gives me the next:
A B C
0 1100 8718 2021-11-21
1 1104 21 0
I have tried another piece of code:
if df['C'].empty:
df['C'] = df['C'].str.replace('', '""')
else:
df['C'] = df['C'].str.replace('-', '') + '000000'
OUTPUT:
A B C
0 1100 8718 20211121000000
1 1104 21 0000000
Use dt.strftime:
df = pd.read_csv('data.csv', sep='|', parse_dates=['C'])
df['C'] = df['C'].dt.strftime('%Y%m%d%H%M%S').fillna('""')
print(df)
# Output:
A B C
0 1100 8718 20211121000000
1 1104 21 ""
A good way would be to convert the column into datetime using pd.to_datetime with parameter errors='coerce' then dropping None values.
import pandas as pd
x = pd.DataFrame({
'one': 20211121000000,
'two': 'not true',
'three': '20211230'
}, index = [1])
x.apply(lambda x: pd.to_datetime(x, errors='coerce')).T.dropna()
# Output:
1
one 1970-01-01 05:36:51.121
three 2021-12-30 00:00:00.000

How to write in excel/pandas Dataframe of two different variables value in same column

I'm new to pandas and trying to figure out how to add two different variables values in the same column.
import pandas as pd
import requests
from bs4 import BeautifulSoup
itemproducts = pd.DataFrame()
url = 'https://www.trwaftermarket.com/en/catalogue/product/BCH720/'
r = requests.get(url)
soup = BeautifulSoup(r.content,'lxml')
code_name = soup.find_all('div',{'class':'col-sm-6 intro-section reset-margin'})
for head in code_name:
item_code = head.find('span',{'class':'heading'}).text
item_name = head.find('span',{'class':'subheading'}).text
for tab_ in tab_4:
ab = tab_.find_all('td')
make_name1 = ab[0].text.replace('Make','')
code1 = ab[1].text.replace('OE Number','')
make_name2 = ab[2].text.replace('Make','')
code2 = ab[3].text.replace('OE Number','')
itemproducts=itemproducts.append({'CODE':item_code,
'NAME':item_name,
'MAKE':[make_name1,make_name2],
'OE NUMBER':[code1,code2]},ignore_index=True)
OUTPUT (Excel image)
What actually I want
In pandas you must specify all the data in the same length. So, in this case, I suggest that you specify each column or row as a fixed length list. For those that have one member less, append a NaN to match.
I found a similar question here on stackoverflow that can help you. Another approach is to use explode function from Pandas Dataframe.
Below I put an example from pandas documentation.
>>> df = pd.DataFrame({'A': [[1, 2, 3], 'foo', [], [3, 4]], 'B': 1})
>>> df
A B
0 [1, 2, 3] 1
1 foo 1
2 [] 1
3 [3, 4] 1
>>> df.explode('A')
A B
0 1 1
0 2 1
0 3 1
1 foo 1
2 NaN 1
3 3 1
3 4 1
I couldn't reproduce the results from your script. However, based on your end dataframe, perhpas you can make use of explode together with apply the dataframe in the end:
#creating your dataframe
itemproducts = pd.DataFrame({'CODE':'BCH720','MAKE':[['HONDA','HONDA']],'NAME':['Brake Caliper'],'OE NUMBER':[['43019-SAA-J51','43019-SAA-J50']]})
>>> itemproducts
CODE MAKE NAME OE NUMBER
0 BCH720 ['HONDA', 'HONDA'] Brake Caliper ['43019-SAA-J51', '43019-SAA-J50']
#using apply method with explode on 'MAKE' and 'OE NUMBER'
>>> itemproducts.apply(lambda x: x.explode() if x.name in ['MAKE', 'OE NUMBER'] else x)
CODE MAKE NAME OE NUMBER
0 BCH720 HONDA Brake Caliper 43019-SAA-J51
0 BCH720 HONDA Brake Caliper 43019-SAA-J50

How to remove double quotes while assigning columns to dataframe

I have below list
ColumnName = 'Emp_id','Emp_Name','EmpAGe'
While i am trying to read above columns and assign inside dataframe i am getting extra double quotes
df = pd.dataframe(data,columns=[ColumnName])
columns=[ColumnName]
i am getting columns = ["'Emp_id','Emp_Name','EmpAGe'"]
how can i handle these extra double quotes and remove them while assigning header to data
This code
ColumnName = 'Emp_id','Emp_Name','EmpAGe'
Is a tuple and not a list.
In case you want three columns, each with values on the tuple above you gonna need
df = pd.dataframe(data,columns=list(ColumnName))
The problem is how you define the columns for pandas DataFrame.
The example below will build a correct data frame :
import pandas as pd
ColumnName1 = 'Emp_id','Emp_Name','EmpAGe'
df1 = [['A1','A1','A2'],['1','2','1'],['a0','a1','a3']]
df = pd.DataFrame(data=df1,columns=ColumnName1 )
df
Result :
Emp_id Emp_Name EmpAGe
0 A1 A1 A2
1 1 2 1
2 a0 a1 a3
A print screen of the code I wrote with the result, with no double quotations
Just for the shake of the understanding, where you can use col.replace to get the desired ..
Let take an example..
>>> df
col1" col2"
0 1 1
1 2 2
Result:
>>> df.columns = [col.replace('"', '') for col in df.columns]
# df.columns = df.columns.str.replace('"', '') <-- can use this as well
>>> df
col1 col2
0 1 1
1 2 2
OR
>>> df = pd.DataFrame({ '"col1"':[1, 2], '"col2"':[1,2]})
>>> df
"col1" "col2"
0 1 1
1 2 2
>>> df.columns = [col.replace('"', '') for col in df.columns]
>>> df
col1 col2
0 1 1
1 2 2
Your input is not quite right. ColumnName is already list-like and it should be passed on directly rather than wrapped in another list. In the latter case it would be interpreted as one single column.
df = pd.DataFrame(data, columns=ColumnName)

Python - Regex split data in Dataframe

I have a column containing values. I want to split it based on a regex. If the regex matches, the original value will be replaced with the left-side of the split. A new column will contain the right-side of a split.
Below is some sample code. I feel I am close but it isn't quite working.
import pandas as pd
import re
df = pd.DataFrame({ 'A' : ["test123","foo"]})
// Regex example to split it if it ends in numbers
r = r"^(.+?)(\d*)$"
df['A'], df['B'] = zip(*df['A'].apply(lambda x: x.split(r, 1)))
print(df)
In the example above I would expect the following output
A B
0 test 123
1 foo
I am fairly new to Python and assumed this would be the way to go. However, it appears that I haven't quite hit the mark. Is anyone able to help me correct this example?
Just base on your own regex
df.A.str.split(r,expand=True).replace('',np.nan).dropna(thresh=1,axis=1).fillna('')
Out[158]:
1 2
0 test 123
1 foo
df[['A','B']]=df.A.str.split(r,expand=True).replace('',np.nan).dropna(thresh=1,axis=1).fillna('')
df
Out[160]:
A B
0 test 123
1 foo
Your regex is working just fine, use it with str.extract
df = pd.DataFrame({ 'A' : ["test123","foo", "12test3"]})
df[['A', 'B']] = df['A'].str.extract("^(.+?)(\d*)$", expand = True)
A B
0 test 123
1 foo
2 12test 3
def bar(x):
els = re.findall(r'^(.+?)(\d*)$', x)[0]
if len(els):
return els
else:
return x, None
def foo():
df = pd.DataFrame({'A': ["test123", "foo"]})
df['A'], df['B'] = zip(*df['A'].apply(bar))
print(df)
result:
A B
0 test 123
1 foo

Read all lines of csv file using .read_csv

I am trying to read simple csv file using pandas but I can't figure out how to not "lose" the first row.
For example:
my_file.csv
Looks like this:
45
34
77
But when I try to to read it:
In [18]: import pandas as pd
In [19]: df = pd.read_csv('my_file.csv', header=False)
In [20]: df
Out[20]:
45
0 34
1 77
[2 rows x 1 columns]
This is not what I am after, I want to have 3 rows. I want my DataFrame to look exactly like this:
In [26]: my_list = [45,34,77]
In [27]: df = pd.DataFrame(my_list)
In [28]: df
Out[28]:
0
0 45
1 34
2 77
[3 rows x 1 columns]
How can I use .read_csv to get the result I am looking for?
Yeah, this is a bit of a UI problem. We should handle False; right now it thinks you want the header on row 0 (== False.) Use None instead:
>>> df = pd.read_csv("my_file.csv", header=False)
>>> df
45
0 34
1 77
>>> df = pd.read_csv("my_file.csv", header=None)
>>> df
0
0 45
1 34
2 77

Categories