I'm working on a project to read in a text file of variable length which will be generated by a user. There are several comments at the beginning of the text file, one of which needs to be used as the column name. I know it is possible to do this with genfromtxt(), but I am required to use pandas. Here is the beginning of a sample text file:
#GeneratedFile
#This file will be generated by a user
#a b c d f g h i j k l m n p q r s t v w x y z
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
I need #a, b, c,... to be the column names. I tried the following lines of code to read in the data and change it to an array, but it returned only rows and ignored the column names.
import pandas as pd
data = pd.read_table('example.txt',header=2)
d = pd.DataFrame.as_matrix(data)
Is there a way to do this without using genfromtxt()?
One way may be to try following:
df = pd.read_csv('example.txt', sep='\s+', engine='python', header=2)
# the first column name become #a so, replacing the column name
df.rename(columns={'#a':'a'}, inplace=True)
# alternatively, other way is to replace # from all the column names
#df.columns = [column_name.replace('#', '') for column_name in df.columns]
print(df)
Result:
a b c d f g h i j k ... p q r s t v w x y z
0 0 1 2 3 4 5 6 7 8 9 ... 13 14 15 16 17 18 19 20 21 22
1 1 2 3 4 5 6 7 8 9 10 ... 14 15 16 17 18 19 20 21 22 23
[2 rows x 23 columns]
Related
I have a dataframe with stores and its invoices numbers and I need to find the missing consecutive invoices numbers per Store, for example:
df1 = pd.DataFrame()
df1['Store'] = ['A','A','A','A','A','B','B','B','B','C','C','C','D','D']
df1['Invoice'] = ['1','2','5','6','8','20','23','24','30','200','202','203','204','206']
Store Invoice
0 A 1
1 A 2
2 A 5
3 A 6
4 A 8
5 B 20
6 B 23
7 B 24
8 B 30
9 C 200
10 C 202
11 C 203
12 D 204
13 D 206
And I want a dataframe like this:
Store MissInvoice
0 A 3
1 A 4
2 A 7
3 B 21
4 B 22
5 B 25
6 B 26
7 B 27
8 B 28
9 B 29
10 C 201
11 D 205
Thanks in advance!
You can use groupby.apply to compute a set difference with the range from the min to max value. Then explode:
(df1.astype({'Invoice': int})
.groupby('Store')['Invoice']
.apply(lambda s: set(range(s.min(), s.max())).difference(s))
.explode().reset_index()
)
NB. if you want to ensure having sorted values, use lambda s: sorted(set(range(s.min(), s.max())).difference(s)).
Output:
Store Invoice
0 A 3
1 A 4
2 A 7
3 B 21
4 B 22
5 B 25
6 B 26
7 B 27
8 B 28
9 B 29
10 C 201
11 D 205
Here's an approach:
import pandas as pd
import numpy as np
df1 = pd.DataFrame()
df1['Store'] = ['A','A','A','A','A','B','B','B','B','C','C','C']
df1['Invoice'] = ['1','2','5','6','8','20','23','24','30','200','202','203']
df1['Invoice'] = df1['Invoice'].astype(int)
df2 = df1.groupby('Store')['Invoice'].agg(['min','max'])
df2['MissInvoice'] = [[]]*len(df2)
for store,row in df2.iterrows():
df2.at[store,'MissInvoice'] = np.setdiff1d(np.arange(row['min'],row['max']+1),
df1.loc[df1['Store'] == store, 'Invoice'])
df2 = df2.explode('MissInvoice').drop(columns = ['min','max']).reset_index()
The resulting dataframe df2:
Store MissInvoice
0 A 3
1 A 4
2 A 7
3 B 21
4 B 22
5 B 25
6 B 26
7 B 27
8 B 28
9 B 29
10 C 201
Note: Store D is absent from the dataframe in my code because it is omitted from the lines in the question defining df1.
I have the following dataframe with each row containing two values.
print(x)
0 0 1
1 4 5
2 8 9
3 10 11
4 14 15
5 16 17
6 16 18
7 16 19
8 17 18
9 17 19
10 18 19
11 20 21
I want to merge these values if one or both values of a particular row reoccur in another row. The principal can be explained as follows: if A and B are together in one row and B and C are together in another row, then it means that A, B and C should be together. What I want as an outcome looking at the dataframe above is:
0 0 1
1 4 5
2 8 9
3 10 11
4 14 15
5 16 17 18 19
6 20 21
I tried creating a loop with df.duplicated that would create such an outcome, but it hasn't worked out yet.
This seems like graph theory problem dealing with connected components. You can use the networkx library:
import networkx as nx
g = nx.from_pandas_edgelist(df, 'a', 'b')
pd.concat([pd.Series([list(i)[0],
' '.join(map(str, list(i)[1:]))],
index=['a', 'b'])
for i in list(nx.connected_components(g))], axis=1).T
Output:
a b
0 0 1
1 4 5
2 8 9
3 10 11
4 14 15
5 16 17 18 19
6 20 21
I am using pandas to run a function on each row of a dataframe and then save the result into a new column. The problem I am having is my function returns a tuple. The function returns for example...
(2345,4837)
And I am saving this as a new column by doing...
myDataFrame['col5'] = myDataFrame.apply(muFunction, axis=1)
This works but I how do I split the return into 2 columns, something like...
myDataFrame['col5'] = myDataFrame.apply(muFunction, axis=1)
myDataFrame['col6'] = myDataFrame.apply(muFunction, axis=1)
But the first part of the tuple in col5 and the second in col6, anyone have an example?
Assume that the source DataFrame contains:
A B C
0 2 4 6
1 4 8 12
2 5 10 15
3 8 16 24
4 9 18 27
The function to apply to it, returning a 2-tuple, is:
def myFun(row):
return row.C + 2, row.C * 2
To apply it and save its result in 2 new columns, you can run:
df[['X', 'Y']] = df.apply(myFun, axis=1).apply(pd.Series)
The result is:
A B C X Y
0 2 4 6 8 12
1 4 8 12 14 24
2 5 10 15 17 30
3 8 16 24 26 48
4 9 18 27 29 54
Assume I have a data frame like:
import pandas as pd
df = pd.DataFrame({"user_id": [1, 5, 11],
"user_type": ["I", "I", "II"],
"joined_for": [1.4, 9.4, 18.1]})
Now I'd like to:
Take each user's joined_for and get the ceiling integer.
Based on the integer, create a new data frame containing number sequences where the maximum is the ceiling number.
This is how I do it now:
import math
new_df = pd.DataFrame()
for i in range(df.shape[0]):
ceil_num = math.ceil(df.iloc[i]["joined_for"])
new_df = new_df.append(pd.DataFrame({"user_id": df.iloc[i]["user_id"],
"joined_month": range(1, ceil_num+1)}),
ignore_index=True)
new_df = new_df.merge(df.drop(columns="joined_for"), on="user_id")
new_df is what I want, but it's so time-consuming when there are lots of users and the number of joined_for can be larger. Is there any better way to do this? Faster or neater?
Using a comprehension
pd.DataFrame([
[t.user_id, m, t.user_type] for t in df.itertuples(index=False)
for m in range(1, math.ceil(t.joined_for) + 1)
], columns=['user_id', 'joined_month', 'user_type'])
user_id joined_month user_type
0 1 1 I
1 1 2 I
2 5 1 I
3 5 2 I
4 5 3 I
5 5 4 I
6 5 5 I
7 5 6 I
8 5 7 I
9 5 8 I
10 5 9 I
11 5 10 I
12 11 1 II
13 11 2 II
14 11 3 II
15 11 4 II
16 11 5 II
17 11 6 II
18 11 7 II
19 11 8 II
20 11 9 II
21 11 10 II
22 11 11 II
23 11 12 II
24 11 13 II
25 11 14 II
26 11 15 II
27 11 16 II
28 11 17 II
29 11 18 II
30 11 19 II
I am currently trying to make use of Pandas MultiIndex attribute. I am trying to group an existing DataFrame-object df_original based on its columns in a smart way, and was therefore thinking of MultiIndex.
print df_original =
by_currency by_portfolio A B C
1 AUD a 1 2 3
2 AUD b 4 5 6
3 AUD c 7 8 9
4 AUD d 10 11 12
5 CHF a 13 14 15
6 CHF b 16 17 18
7 CHF c 19 20 21
8 CHF d 22 23 24
Now, what I would like to have is a MultiIndex DataFrame-object, with A, B and C, and by_portfolio as indices. Looking like
CHF AUD
A a 13 1
b 16 4
c 19 7
d 22 10
B a 14 2
b 17 5
c 20 8
d 23 11
C a 15 3
b 18 6
c 21 9
d 24 12
I have tried making all columns in df_original and the sought after indices into list-objects, and from there create a new DataFrame. This seems a bit cumbersome, and I can't figure out how to add the actual values after.
Perhaps some sort of groupby is better for this purpose? Thing is I will need to be able to add this data to another, similar, DataFrame, so I will need the resulting DataFrame to be able to be added to another one later on.
Thanks
You can use a combination of stack and unstack:
In [50]: df.set_index(['by_currency', 'by_portfolio']).stack().unstack(0)
Out[50]:
by_currency AUD CHF
by_portfolio
a A 1 13
B 2 14
C 3 15
b A 4 16
B 5 17
C 6 18
c A 7 19
B 8 20
C 9 21
d A 10 22
B 11 23
C 12 24
To obtain your desired result, we only need to swap the levels of the index:
In [51]: df2 = df.set_index(['by_currency', 'by_portfolio']).stack().unstack(0)
In [52]: df2.columns.name = None
In [53]: df2.index = df2.index.swaplevel(0,1)
In [55]: df2 = df2.sort_index()
In [56]: df2
Out[56]:
AUD CHF
by_portfolio
A a 1 13
b 4 16
c 7 19
d 10 22
B a 2 14
b 5 17
c 8 20
d 11 23
C a 3 15
b 6 18
c 9 21
d 12 24