Python - Append 2 columns of CSV together - python

I am loading a csv file into a data frame using pandas.
An example dataframe is this:
X Y
1 4
2 5
3 6
I wish to append these two columns into a new column:
X Y Z
1 4 1
2 5 2
3 6 3
4
5
6
How can this be done using python.
Thank you!

Here's one way to do that:
res = pd.concat([df, df.melt()["value"]], axis=1)
print(res)
The output is:
X Y value
0 1.0 4.0 1
1 2.0 5.0 2
2 3.0 6.0 3
3 NaN NaN 4
4 NaN NaN 5
5 NaN NaN 6

Related

Python How to drop rows of Pandas DataFrame whose value in a certain column is NaN

I have this DataFrame and want only the records whose "Total" column is not NaN ,and records when A~E has more than two NaN:
A B C D E Total
1 1 3 5 5 8
1 4 3 5 5 NaN
3 6 NaN NaN NaN 6
2 2 5 9 NaN 8
..i.e. something like df.dropna(....) to get this resulting dataframe:
A B C D E Total
1 1 3 5 5 8
2 2 5 9 NaN 8
Here's my code
import pandas as pd
dfInputData = pd.read_csv(path)
dfInputData = dfInputData.dropna(axis=1,how = 'any')
RowCnt = dfInputData.shape[0]
But it looks like no modification has been made even error
Please help!! Thanks
Use boolean indexing with count all columns without Total for number of missing values and not misisng values in Total:
df = df[df.drop('Total', axis=1).isna().sum(axis=1).le(2) & df['Total'].notna()]
print (df)
A B C D E Total
0 1 1 3.0 5.0 5.0 8.0
3 2 2 5.0 9.0 NaN 8.0
Or filter columns between A:E:
df = df[df.loc[:, 'A':'E'].isna().sum(axis=1).le(2) & df['Total'].notna()]
print (df)
A B C D E Total
0 1 1 3.0 5.0 5.0 8.0
3 2 2 5.0 9.0 NaN 8.0

Pandas Dataframe Question: Subtract next row and add specific value if NaN

Trying to groupby in pandas, then sort values and have a result column show what you need to add to get to the next row in the group, and if your are the end of the group. To replace the value with the number 3. Anyone have an idea how to do it?
import pandas as pd
df = pd.DataFrame({'label': 'a a b c b c'.split(), 'Val': [2,6,6, 4,16, 8]})
df
label Val
0 a 2
1 a 6
2 b 6
3 c 4
4 b 16
5 c 8
Id like the results as shown below, that you have to add 4 to 2 to get 6. So the groups are sorted. But if there is no next value in the group and NaN is added. To replace it with the value 3. I have shown below what the results should look like:
label Val Results
0 a 2 4.0
1 a 6 3.0
2 b 6 10.0
3 c 4 4.0
4 b 16 3.0
5 c 8 3.0
I tried this, and was thinking of shifting values up but the problem is that the labels aren't sorted.
df['Results'] = df.groupby('label').apply(lambda x: x - x.shift())`
df
label Val Results
0 a 2 NaN
1 a 6 4.0
2 b 6 NaN
3 c 4 NaN
4 b 16 10.0
5 c 8 4.0
Hope someone can help:D!
Use groupby, diff and abs:
df['Results'] = abs(df.groupby('label')['Val'].diff(-1)).fillna(3)
label Val Results
0 a 2 4.0
1 a 6 3.0
2 b 6 10.0
3 c 4 4.0
4 b 16 3.0
5 c 8 3.0

Pandas: How to replace values of Nan in column based on another column?

Given that, i have a dataset as below:
dict = {
"A": [math.nan,math.nan,1,math.nan,2,math.nan,3,5],
"B": np.random.randint(1,5,size=8)
}
dt = pd.DataFrame(dict)
My favorite output is, if the in column A we have an Nan then multiply the value of the column B in the same row and replace it with Nan. So, given that, the below is my dataset:
A B
NaN 1
NaN 1
1.0 3
NaN 2
2.0 3
NaN 1
3.0 1
5.0 3
My favorite output is:
A B
2 1
2 1
1 3
4 2
2 3
2 1
3 1
5 3
My current solution is as below which does not work:
dt[pd.isna(dt["A"])]["A"] = dt[pd.isna(dt["A"])]["B"].apply( lambda x:2*x )
print(dt)
In your case with fillna
df.A.fillna(df.B*2, inplace=True)
df
A B
0 2.0 1
1 2.0 1
2 1.0 3
3 4.0 2
4 2.0 3
5 2.0 1
6 3.0 1
7 5.0 3

Is it possible to apply groupby in python pandas to an already grouped object?

I have a dataset which looks like below:
File_no A B Date Batch State
0 1 2 3 23-1-2019 2 3
1 2 7 6 23-1-2019 2 4
2 3 9 2 24-1-2019 1 2
3 5 6 3 24-1-2019 2 3
4 6 4 3 24-1-2019 1 4
5 8 2 3 25-1-2019 1 4
I want to group the data columns 'A' and 'B' based on date and batch. And then do a shift of rows of these columns based on the sequence of file numbers. For instance, in the above dataframe File no 4 is missing.
I am able to achive the shift function, but I am not able to do it for every group individually.
For e.g: 6 & 8 files are not in sequence, but they are from different dates. So the shift should not be performed because it is missing a sequence.
diff = data['File_no'].diff().ne(1).cumsum()
grouped=data.groupby(['Date','Batch'])
grouped.apply(lambda data: data.groupby(diff)['A','B'].shift())
This performs a shift, whenever there is a missing sequence and doesn't consider the groups into consideration.
Expected output:
File_no A B Date Batch State
0 1 Nan Nan 23-1-2019 2 3
1 2 2 3 23-1-2019 2 4
2 3 9 2 24-1-2019 1 2
3 5 Nan Nan 24-1-2019 2 3
4 6 6 3 24-1-2019 1 4
5 8 2 3 25-1-2019 1 4
I think you can pass columns with series to one groupby:
diff = data['File_no'].diff().ne(1).cumsum()
data[['A','B']] = data.groupby(['Date','Batch',diff])['A','B'].shift()
print (data)
File_no A B Date Batch State
0 1 NaN NaN 23-1-2019 2 3
1 2 2.0 3.0 23-1-2019 2 4
2 3 NaN NaN 24-1-2019 1 2
3 5 NaN NaN 24-1-2019 2 3
4 6 NaN NaN 24-1-2019 1 4
4 8 NaN NaN 25-1-2019 1 4
EDIT:
r = np.arange(data['File_no'].min(), data['File_no'].max() + 1)
data = data.set_index('File_no').reindex(r)
diff = data.index.to_series().diff().ne(1).cumsum()
data[['A','B']] = data.groupby(['Date','Batch',diff])['A','B'].shift()
data = data.dropna(how='all').reset_index()
print (data)
File_no A B Date Batch State
0 1 NaN NaN 23-1-2019 2.0 3.0
1 2 2.0 3.0 23-1-2019 2.0 4.0
2 3 NaN NaN 24-1-2019 1.0 2.0
3 5 NaN NaN 24-1-2019 2.0 3.0
4 6 9.0 2.0 24-1-2019 1.0 4.0
5 8 NaN NaN 25-1-2019 1.0 4.0

Reading a text file in pandas with separator as linefeed (\n) and line terminator as two linefeeds (\n\n)

I have a text file of the form :
data.txt
2
8
4
3
1
9
6
5
7
How to read it into a pandas dataframe
0 1 2
0 2 8 4
1 3 1 9
2 6 5 7
Try this:
with open(filename, 'r') as f:
data = f.read().replace('\n',',').replace(',,','\n')
In [7]: pd.read_csv(pd.compat.StringIO(data), header=None)
Out[7]:
0 1 2
0 2 8 4
1 3 1 9
2 6 5 7
Option 1
Much easier, if you know there are always N elements in a group - just load your data and reshape -
pd.DataFrame(np.loadtxt('data.txt').reshape(3, -1))
0 1 2
0 2.0 8.0 4.0
1 3.0 1.0 9.0
2 6.0 5.0 7.0
To load integers, pass dtype to loadtxt -
pd.DataFrame(np.loadtxt('data.txt', dtype=int).reshape(3, -1))
0 1 2
0 2 8 4
1 3 1 9
2 6 5 7
Option 2
This is more general, will work when you cannot guarantee that there are always 3 numbers at a time. The idea here is to read in blank lines as NaN, and separate your data based on the presence of NaNs.
df = pd.read_csv('data.txt', header=None, skip_blank_lines=False)
df
0
0 2.0
1 8.0
2 4.0
3 NaN
4 3.0
5 1.0
6 9.0
7 NaN
8 6.0
9 5.0
10 7.0
df_list = []
for _, g in df.groupby(df.isnull().cumsum().values.ravel()):
df_list.append(g.dropna().reset_index(drop=True))
df = pd.concat(df_list, axis=1, ignore_index=True)
df
0 1 2
0 2.0 8.0 4.0
1 3.0 1.0 9.0
2 6.0 5.0 7.0
Caveat - if your data also has NaNs, this will not separate properly.
Although this is definitely not the best way to handle it, we can do some processing ourselves. In case the values are integers, the following should work:
import pandas as pd
with open('data.txt') as f:
data = [list(map(int, row.split())) for row in f.read().split('\n\n')]
dataframe = pd.DataFrame(data)
which produces:
>>> dataframe
0 1 2
0 2 8 4
1 3 1 9
2 6 5 7

Categories