I have a one column Pandas dataframe:
'asdf'
0
1
1
1
0
...
1
How do I turn it into:
1 0
0 1
0 1
0 1
1 0
such that the first column is the "0" column and the second column is the "1" column and both are dummy variables?
You can use get_dummies, for example
import numpy as np
import pandas as pd
one = pd.DataFrame({'asdf':np.random.randint(0,2,10)})
two = pd.get_dummies(one.loc[:,'asdf'])
Related
I have several csv files that has same first row element in it.
For example:
csv-1.csv:
Value,0
Currency,0
datetime,0
Receiver,0
Beneficiary,0
Flag,0
idx,0
csv-2.csv:
Value,0
Currency,1
datetime,0
Receiver,0
Beneficiary,0
Flag,0
idx,0
And with these files (more than 2 files by the way) I want to merge them and create something like that:
left
csv-1
csv-2
Value
0
0
Currency
0
1
datetime
0
0
How can I create this funtion in python?
You can use:
import pandas as pd
import pathlib
out = (pd.concat([pd.read_csv(csvfile, header=None, index_col=[0], names=[csvfile.stem])
for csvfile in sorted(pathlib.Path.cwd().glob('*.csv'))], axis=1)
.rename_axis('left').reset_index())
Output:
>>> out
left csv-1 csv-2
0 Value 0 0
1 Currency 0 1
2 datetime 0 0
3 Receiver 0 0
4 Beneficiary 0 0
5 Flag 0 0
6 idx 0 0
First, you must create indexes in dataframes by first columns, on which you will further join:
import pandas as pd
import numpy as np
df1 = pd.read_csv('csv-1.csv')
df2 = pd.read_csv('csv-2.csv')
df1 = df1.set_index('col1')
df2 = df2.set_index('col1')
df = df1.join(df2, how='outer')
Then rename the column names if needed, or make a new index
Here's what you can do
import pandas as pd
from glob import glob
def refineFilename(path):
return path.split(".")[0]
df=pd.DataFrame()
for file in glob("csv-*.csv"):
new=pd.read_csv(file,header=None,index_col=[0])
df[refineFinename(file)]=new[1]
df.reset_index(inplace=True)
df.rename(columns={0:"left"},inplace=True)
print(df)
"""
left csv-1 csv-2
0 Value 0 0
1 Currency 0 1
2 datetime 0 0
3 Receiver 0 0
4 Beneficiary 0 0
5 Flag 0 0
6 idx 0 0
"""
What we are doing here is making the df variable store all data, and then iterating through all files and adding a second column of those files to df with file name as the column name.
I have two data frame dfa and dfb
import pandas as pd
import numpy as np
dfa = pd.DataFrame(np.array([[1,0,0,1,0], [1,1,1,0,0]]))
dfa
dfb = pd.DataFrame(np.array([[1,1,0,1,0], [0,0,1,0,0],[0,0,1,1,1]]))
dfb
Now, I need overwrite data frame dfa first row by taking union of data frame dfb first row and second row
[1,0,0,1,0] union [1,1,0,1,0] union [0,0,1,0,0] = [1,1,1,1,0]
final data dfa frame I need look like below
dfa
0 1 2 3 4
0 1 1 1 1 0
1 1 1 1 0 0
Suppose that i have this data,
import pandas as pd
data = pd.DataFrame({'Id':[1,1,1,6,7],'Sales':[2,3,4,2,8]})
Is there a filter such that it will output a dataframe such that the Id are the same? See expected output below:
Let us try
data=data[data.Id.duplicated(keep=False)]
Id Sales
0 1 2
1 1 3
2 1 4
I just need one column of my dateframe, but in the original order. When I take it off, it is sorted by the values, and I can't understand why. I tried different ways to pick out one column but all the time it was sorted by the values.
this is my code:
import pandas
data = pandas.read_csv('/data.csv', sep=';')
longti = data.iloc[:,4]
To return the first Column your function should work.
import pandas as pd
df = pd.DataFrame(dict(A=[1,2,3,4,5,6], B=['A','B','C','D','E','F']))
df = df.iloc[:,0]
Out:
0 1
1 2
2 3
3 4
4 5
5 6
If you want to return the second Column you can use the following:
df = df.iloc[:,1]
Out:
0 A
1 B
2 C
3 D
4 E
5 F
When working with a DataFrame, is there a way to change the value of a cell based on a value in a column?
For example, I have a DataFrame of exam results that looks like this:
answer_is_a answer_is_c
0 a a
1 b b
2 c c
I want to code them as correct (1) and incorrect(0). So it would look like this:
answer_is_a answer_is_c
0 1 0
1 0 0
2 0 1
So I need to iterate over the entire DataFrame, compare what is already in the cell with the last character of the column header, and then change the cell value.
Any thoughts?
By default, DataFrame.apply iterates through the columns, passing each as a series to the function you feed it. Series have a name attribute that is a string we'll use to extract the answer.
So you could do this:
from io import StringIO
import pandas
data = StringIO("""\
answer_is_a answer_is_c
a a
b b
c c
""")
x = (
pandas.read_table(data, sep='\s+')
.apply(lambda col: col == col.name.split('_')[-1])
.astype(int)
)
And x prints out as:
answer_is_a answer_is_c
0 1 0
1 0 0
2 0 1