I want to solve a problem that essentially boils down to this:
I have identifier numbers (thousands of them) and each should be uniquely linked to a set of letters. Let's call them a through e. These can be filled from another column (y) if that helps.
Ocassionally one of the letters is missing and is registered as NAN. How can I replace such that I get all the required numbers.
Idnumber X y
1 a a
2 a a
1 b b
1 NaN d
2 b NaN
1 d c
2 c NaN
1 NaN e
2 d d
2 e e
Any given x can be missing.
The dataset it too big to simply add all posibilities and drop dupplicates.
The idea is to get:
Idnumber X
1 a
2 a
1 b
1 c
2 b
1 d
2 c
1 e
2 d
2 e
The main issue is getting a unique solution. So making sure that I replace one NaN by c and one by e.
Is this what you're looking for? Or does this use too much RAM? If it does use too much RAM, you can use the chunksize parameter in read_csv. Then write results (with duplicates and nans dropped) for each individual chunk to csv, then load those and drop duplicates again - this time just dropping duplicates that conflict across chunks.
#Loading Dataframe
from StringIO import StringIO
x=StringIO('''Idnumber,X,y
1,a,a
2,a,a
1,b,b
1,NaN,d
2,b,NaN
1,d,c
2,c,NaN
1,NaN,e
2,d,d
2,e,e''')
#Operations on Dataframe
df = pd.read_csv(x)
df1 = df[['Idnumber','X']]
df2 = df[['Idnumber','y']]
df2.rename(columns={'y': 'X'}, inplace=True)
pd.concat([df1,df2]).dropna().drop_duplicates()
Related
I'd need a little suggestion on a procedure using pandas, I have a 2-columns dataset that looks like this:
A 0.4533
B 0.2323
A 1.2343
A 1.2353
B 4.3521
C 3.2113
C 2.1233
.. ...
where first column contains strings and the second one floats. I would like to save the minimum value for each group of unique strings in order to have the associated minimum with A, B, C. Does anybody have any suggestions on that? It could help me also storing somehow all the values for each string they are associated.
Many thanks,
James
Input data:
>>> df
0 1
0 A 0.4533
1 B 0.2323
2 A 1.2343
3 A 1.2353
4 B 4.3521
5 C 3.2113
6 C 2.1233
Use groupby before min:
out = df.groupby(0).min()
Output result:
>>> out
1
0
A 0.4533
B 0.2323
C 2.1233
Update:
filter out all the values in the original dataset that are more than 20% different from the minimum
out = df[df.groupby(0)[1].apply(lambda x: x <= x.min() * 1.2)]
>>> out
0 1
0 A 0.4533
1 B 0.2323
6 C 2.1233
You can simply do it by
min_A=min(df[df["column_1"]=="A"]["value"])
min_B=min(df[df["column_1"]=="B"]["value"])
min_C=min(df[df["column_1"]=="C"]["value"])
where df = Dataframe column_1 and value are the names of the columns of the dataframe
You can also do it by using the pre-defined function of pandas i.e. groupby()
>> df.groupby(["column_1"]).min()
The Above will also give the same results.
I searched and I couldn't find a problem like mine. So if there is and somehow I couldn't find please let me know. So I can delete this post.
I stuck with a problem to split pandas dataframe into different data frames (df) by a value.
I have a dataset inside a text file and I store them as pandas dataframe that has only one column. There are more than one sets of information inside the dataset and a certain value defines the end of that set, you can see a sample below:
The Sample Input
In [8]: df
Out[8]:
var1
0 a
1 b
2 c
3 d
4 endValue
5 h
6 f
7 b
8 w
9 endValue
So I want to split this df into different data frames. I couldn't find a way to do that but I'm sure there must be an easy way. The format I display in sample output can be a wrong format. So, If you have a better idea I'd love to see. Thank you for help.
The sample output I'd like
var1
{[0 a
1 b
2 c
3 d
4 endValue]},
{[0 h
1 f
2 b
3 w
4 endValue]}
You could check where var1 is endValue, take the cumsum, and use the result as a custom grouper. Then Groupby and build a dictionary from the result:
d = dict(tuple(df.groupby(df.var1.eq('endValue').cumsum().shift(fill_value=0.))))
Or for a list of dataframes (effectively indexed in the same way):
l = [v for _,v in df.groupby(df.var1.eq('endValue').cumsum().shift(fill_value=0.))]
print(l[0])
var1
0 a
1 b
2 c
3 d
4 endValue
One idea with unique index values is replace non matched values to NaNs and backfilling them, last loop groupby object for list of DataFrames:
g = df.index.to_series().where(df['var1'].eq('endValue')).bfill()
dfs = [a for i, a in df.groupby(g, sort=False)]
print (dfs)
[ var1
0 a
1 b
2 c
3 d
4 endValue, var1
5 h
6 f
7 b
8 w
9 endValue]
I have a pandas DataFrame df that portrays edges of a directed acyclic graph, sorted by Target:
Source Target
C A
D A
A B
C B
D B
E B
E C
C D
E D
I would like to add a column Weight based on occurrences of values.
Weight should illustrate the number of appearance of the Target value in Target divided by the number of appearance of the Source value in Target.
In other words, the first row of the example should have the Weight of 2/1 = 2, since A appears twice in Target where C appears only once in Target.
I have first tried
df.apply(pd.Series.value_counts)
but the problem is my actual DataFrame is extremely large, so I am not able to manually search for each occurrence value from the outcome and make a quotient. I have also tried to write two new columns that signify the values I need, then to write a final column that consists of what I want:
df['tfreq'] = df.groupby('Target')['Target'].transform('count')
df['sfreq'] = df.groupby('Source')['Target'].transform('count')
but it seems like my second line of code returns the occurrences of Source values in Source column instead of Target column.
Are there any insights on this problem?
Use value_counts with map. Then divide them:
val_counts = df['Target'].value_counts()
counts1 = df['Target'].map(val_counts)
counts2 = df['Source'].map(val_counts)
df['Weights'] = counts1.div(counts2) # same as counts1 / counts2
Output
Source Target Weights
0 C A 2.0
1 D A 1.0
2 A B 2.0
3 C B 4.0
4 D B 2.0
5 E B NaN
6 E C NaN
7 C D 2.0
8 E D NaN
note: we get NaN because E does not occur in column Target
I have two (actually many, but stick with two) datasets and I need to merge them together. However, they are not same range and they have different reference values. Lets consider
a 1
b 2
c 3
e 4
and
a 2
b 3
d 7
e 2
I tried to simulate Excel index and match function, but I am not able to get the right result
b = []
f = []
for i in data1["c1"]:
if i in data2["c1"]:
a = d3[data2["c4"].index[i]]
f = b.append(a)
else:
continue
print(f)
Can you please help me how this works? I would also welcome some link with further information about this topic. Thank you
If you want to create a consolidated file from the two above like:
Col1 Col2 Col3
a 1 2
b 2 3
c 3 7
d 4 2
You can simply use dictionaries, with keys as your column 1 values: a, b, c, d and values as list of the 2nd column values from your two DataFrames respectively like:
your_dict = {a:[1,2], b:[2,3], c:[3,7], d:[4,2]}
Then to output that into one DataFrame such as the one above, just use the .from_dict() method in pandas with the orient parameter equal to 'index' see documentation here.
I have an almost embarrassingly simple question, which I cannot figure out for myself.
Here's a toy example to demonstrate what I want to do, suppose I have this simple data frame:
df = pd.DataFrame([[1,2,3,4,5,6],[7,8,9,10,11,12]],index=range(2),columns=list('abcdef'))
a b c d e f
0 1 2 3 4 5 6
1 7 8 9 10 11 12
What I want is to stack it so that it takes the following form, where the columns identifiers have been changed (to X and Y) so that they are the same for all re-stacked values:
X Y
0 1 2
3 4
5 6
1 7 8
9 10
11 12
I am pretty sure you can do it with pd.stack() or pd.pivot_table() but I have read the documentation, but cannot figure out how to do it. But instead of appending all columns to the end of the next, I just want to append a pairs (or triplets of values actually) of values from each row.
Just to add some more flesh to the bones of what I want to do;
df = pd.DataFrame(np.random.randn(3,6),index=range(3),columns=list('abcdef'))
a b c d e f
0 -0.168636 -1.878447 -0.985152 -0.101049 1.244617 1.256772
1 0.395110 -0.237559 0.034890 -1.244669 -0.721756 0.473696
2 -0.973043 1.784627 0.601250 -1.718324 0.145479 -0.099530
I want this to re-stacked into this form (where column labels have been changed again, to the same for all values):
X Y Z
0 -0.168636 -1.878447 -0.985152
-0.101049 1.244617 1.256772
1 0.395110 -0.237559 0.034890
-1.244669 -0.721756 0.473696
2 -0.973043 1.784627 0.601250
-1.718324 0.145479 -0.099530
Yes, one could just make a for-loop with the following logic operating on each row:
df.values.reshape(df.shape[1]/3,2)
But then you would have to compute each row individually and my actual data has tens of thousands of rows.
So I want to stack each individual row selectively (e.g. by pairs of values or triplets), and then stack that row-stack, for the entire data frame, basically. Preferably done on the entire data frame at once (if possible).
Apologies for such a trivial question.
Use numpy.reshape to reshape the underlying data in the DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3,6),index=range(3),columns=list('abcdef'))
print(df)
# a b c d e f
# 0 -0.889810 1.348811 -1.071198 0.091841 -0.781704 -1.672864
# 1 0.398858 0.004976 1.280942 1.185749 1.260551 0.858973
# 2 1.279742 0.946470 -1.122450 -0.355737 1.457966 0.034319
result = pd.DataFrame(df.values.reshape(-1,3),
index=df.index.repeat(2), columns=list('XYZ'))
print(result)
yields
X Y Z
0 -0.889810 1.348811 -1.071198
0 0.091841 -0.781704 -1.672864
1 0.398858 0.004976 1.280942
1 1.185749 1.260551 0.858973
2 1.279742 0.946470 -1.122450
2 -0.355737 1.457966 0.034319