merge all columns in data frame to a list - python

I have a csv file with no headers and want to take all the columns in the data frame and want to "append" all the columns into a list.
for example I have df:
1 2 7
3 4 8
5 6 9
I have a pseudocode on what I want to perform
import pandas as pd
df = pd.read_csv('file.csv',header=None)
data = []
for i in range(number of columns) #in this case we have 3
data.append(df[i])
#data = [1,2,3,4,5,6,7,8,9]

UUIC, you can try flatten then sort
import numpy as np
np.sort(df.values.flatten())
array([1, 2, 3, 4, 5, 6, 7, 8, 9])

Related

Insert Row in Dataframe at certain place

I have the following Dataframe:
Now i want to insert an empty row after every time the column "Zweck" equals 7.
So for example the third row should be an empty row.
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [1, 2, 3, 4, 5], 'f': [1, 7, 3, 4, 7]})
ren_dict = {i: df.columns[i] for i in range(len(df.columns))}
ind = df[df['f'] == 7].index
df = pd.DataFrame(np.insert(df.values, ind, values=[33], axis=0))
df.rename(columns=ren_dict, inplace=True)
ind_empt = df['a'] == 33
df[ind_empt] = ''
print(df)
Output
a b f
0 1 1 1
1
2 2 2 7
3 3 3 3
4 4 4 4
5
6 5 5 7
Here the dataframe is overwritten, as the append operation will be resource intensive. As a result, the required strings with values 33 appear. This is necessary because np.insert does not allow string values to be substituted. Columns are renamed to their original state with: df.rename. Finally, we find lines with df['a'] == 33 to set to empty values.

How to union some records in a pandas data frame that has intersection

I have a data frame with two columns. Each row contain start and end of a ranges and data frame is sorted.
I want to union every ranges that have intersection till there are not any pair of ranges that have intersection.
my solution is using for loop and iterate over all rows and union them but it is very slow. Can anyone present a faster way for this?
Example
Input:
A
B
1
5
2
4
7
9
11
20
12
21
Output:
A
B
1
5
7
9
11
21
for creating data frame, use below code:
import pandas as pd
a = [1, 2, 7, 11, 12]
b = [5, 4, 9, 20, 21]
df = pd.DataFrame({"A": a, "B": b})
I have a new solution that is faster than for loop in large data frames. suppose name of data frame is df.
df["merger"] = df.A.gt(df.B.shift(1)).astype("Int8").cumsum()
df = df.groupby(by=["merger"]).agg({"A": "min", "B": "max"}).reset_index(drop=True)

Converting columns names from a list

I am reading multiple csv files into a pandas data frame as a list before concatenating them together. All the files from the first have different column names, but I wanted to convert those names to have the same as the first file, so that I can combine them by rows relative to the same column names.
I can call them as a list like:
dfs = (pd.read_csv(f) for f in x)
However, when I concatenate them together the data frame combines both columns together, here's an example data of the outcome:
fs = pd.DataFrame(np.random.randn(5, 3),
index=[1, 2, 3, 4, 5],
columns=['bgif', 'datasetkey', 'occurrenceid'])
ds = pd.DataFrame(np.random.randn(5, 3),
index=[1, 2, 3, 4, 5],
columns=['v1', 'v2', 'v3'])
df_row_merged = pd.concat([fs, ds], ignore_index=True)
So I was wondering how I could change the header of the files to have the same as the first as I presume this could bind them together?
Use np.concatenate to keep only values.
IIUC, something like that should work:
dfs = [fs, ds]
df_row_merged = pd.DataFrame(np.concatenate(dfs), columns=dfs[0].columns)
>>> df_row_merged
bgif datasetkey occurrenceid
0 -0.414690 0.842747 -1.653554
1 0.556024 0.577895 0.852845
2 -0.151411 0.558659 -1.219965
3 -0.702385 -0.895022 -1.123310
4 0.356573 2.121478 0.321810
5 3.349352 -0.746372 -0.849632
6 1.142182 0.175079 0.179597
7 -0.755518 0.365921 -0.212967
8 -1.559804 -0.024858 -0.233414
9 -0.602356 1.521461 0.747047

How to delete the randomly sampled rows of a dataframe, to avoid sampling them again?

I have dataframe (df) of 12 rows x 5 columns. I sample 1 row from each label and create a new dataframe (df1) of 3 rows x 5 columns. I need that the next time I sample more rows from df I will not choose the same ones that are already in df1. So how can I delete the already sampled rows from df?
import pandas as pd
import numpy as np
# 12x5
df = pd.DataFrame(np.random.rand(12, 5))
label=np.array([1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3])
df['label'] = label
#3x5
df1 = pd.concat(g.sample(1) for idx, g in df.groupby('label'))
#My attempt. It should be a 9x5 dataframe
df2 = pd.concat(f.drop(idx) for idx, f in df1.groupby('label'))
df
df1
df2
Starting with this DataFrame:
df = pd.DataFrame(np.random.rand(12, 5))
label=np.array([1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3])
df['label'] = label
Your first sample is this:
df1 = pd.concat(g.sample(1) for idx, g in df.groupby('label'))
For the second sample, you can drop df1's indices from df:
pd.concat(g.sample(1) for idx, g in df.drop(df1.index).groupby('label'))
Out:
0 1 2 3 4 label
2 0.188005 0.765640 0.549734 0.712261 0.334071 1
4 0.599812 0.713593 0.366226 0.374616 0.952237 2
8 0.631922 0.585104 0.184801 0.147213 0.804537 3
This is not an inplace operation. It doesn't modify the original DataFrame. It just drops the rows, returns a copy, and samples from that copy. If you want it to be permanent, you can do:
df2 = df.drop(df1.index)
And sample from df2 afterwards.

How can I print out just the index of a pandas dataframe?

I'm trying to make a table, and the way Pandas formats its indices is exactly what I'm looking for. That said, I don't want the actual data, and I can't figure out how to get Pandas to print out just the indices without the corresponding data.
You can access the index attribute of a df using .index:
In [277]:
df = pd.DataFrame({'a':np.arange(10), 'b':np.random.randn(10)})
df
Out[277]:
a b
0 0 0.293422
1 1 -1.631018
2 2 0.065344
3 3 -0.417926
4 4 1.925325
5 5 0.167545
6 6 -0.988941
7 7 -0.277446
8 8 1.426912
9 9 -0.114189
In [278]:
df.index
Out[278]:
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')
.index.tolist() is another function which you can get the index as a list:
In [1391]: datasheet.head(20).index.tolist()
Out[1391]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
You can access the index attribute of a df using df.index[i]
>> import pandas as pd
>> import numpy as np
>> df = pd.DataFrame({'a':np.arange(5), 'b':np.random.randn(5)})
a b
0 0 1.088998
1 1 -1.381735
2 2 0.035058
3 3 -2.273023
4 4 1.345342
>> df.index[1] ## Second index
>> df.index[-1] ## Last index
>> for i in xrange(len(df)):print df.index[i] ## Using loop
...
0
1
2
3
4
You can use lamba function:
index = df.index[lambda x : for x in df.index() ]
print(index)
You can always try df.index. This function will show you the range index.
Or you can always set your index. Let say you had a weather.csv file with headers:
'date', 'temperature' and 'event'. And you want set "date" as your index.
import pandas as pd
df = pd.read_csvte'weather_file)
df.set_index('day', inplace=True)
df

Categories