Match randomly selected columns in two different csv files - python

I want to select 20 random values from ID column of first csv and search to another ID column of second csv file then I want to compare these IDs and write the matching values from the second and first csv file under each other.
\First CSV \Second CSV
ID ... ID ...
123 456
456 865
765 876
865 900
456 123
. .
. .
. .
876 765
\Output (should include only 20 ID )
ID_1 ID_2
123 123
456 456
765 765
865 865
. .
. .
. .
876 876
I tried these scripts and also I obtained expected output but I could not be sure that Is the code working correctly? Is it really pulling values from two separate csv files? Also I want to simplify.
code

Related

Pandas filtering based on minimum data occurrences across multiple columns

I have a dataframe like this
country data_fingerprint organization
US 111 Tesco
UK 222 IBM
US 111 Yahoo
PY 333 Tesco
US 111 Boeing
CN 333 TCS
NE 458 Yahoo
UK 678 Tesco
I want those data_fingerprint for where the organisation and country with top 2 counts exists
So if see in organization top 2 occurrences are for Tesco,Yahoo and for country we have US,UK .
So based on that the output of data_fingerprint should be having
data_fingerprint
111
678
What i have tried for organization to exist in my complete dataframe is this
# First find top 2 occurances of organization
nd = df['organization'].value_counts().groupby(level=0, group_keys=False).head(2)
# Then checking if the organization exist in the complete dataframe and filtering those rows
new = df["organization"].isin(nd)
But i am not getting any data here.Once i get data for this I can do it along with country
Can someone please help to get me the output.I have less data so using Pandas
here is one way to do it
df[
df['organization'].isin(df['organization'].value_counts().head(2).index) &
df['country'].isin(df['country'].value_counts().head(2).index)
]['data_fingerprint'].unique()
array([111, 678], dtype=int64)
Annotated code
# find top 2 most occurring country and organization
i1 = df['country'].value_counts().index[:2]
i2 = df['organization'].value_counts().index[:2]
# Create boolean mask to select the rows having top 2 country and org.
mask = df['country'].isin(i1) & df['organization'].isin(i2)
# filter the rows using the mask and drop dupes in data_fingerprint
df.loc[mask, ['data_fingerprint']].drop_duplicates()
Result
data_fingerprint
0 111
7 678
You can do
# First find top 2 occurances of organization
nd = df['organization'].value_counts().head(2).index
# Then checking if the organization exist in the complete dataframe and filtering those rows
new = df["organization"].isin(nd)
Output - Only Tesco and Yahoo left
df[new]
country data_fingerprint organization
0 US 111 Tesco
2 US 111 Yahoo
3 PY 333 Tesco
6 NE 458 Yahoo
7 UK 678 Tesco
You can do the same for country

Is there a way to find the number of occurrences of each value in a column in another column?

I have two dataframes called dataset1 and dataset 2 (shown below). The "pattern" and "SAX" columns contain string values.
dataset1=
pattern tstamps
0 glngsyu 1610460
1 zicobgm 1610466
2 eerptow .
3 cqbsynt .
4 zvmqben .
.. ...
475 rfikekw
476 bnbzvqx
477 rsuhgax
478 ckhloio
479 lbzujtw
480 rows × 1 columns
dataset2 =
SAX timestamp
0 hssrlcu 16015
1 ktyuymp 16016
2 xncqmfr 16017
3 aanlmna 16018
4 urvahvo 16019
... ... ...
263455 jeivqzo 279470
263456 bzasxgw 279471
263457 jspqnqv 279472
263458 sxwfchj 279473
263459 gxqnhfr 279474
263460 rows × 2 columns
Is there a way to check the the occurrence count of each row of pattern(dataset1) in SAX(dataset2). Basically the number of time's a value in pattern column of(dataset1) exists in the SAX column of (dataset2)?
Something basically like this:
dataset1=
pattern no. of occurrences
0 glngsyu 3
1 zicobgm 0
2 eerptow 1
. . .
. . .
. . .
479 lbzujtw 2
480 rows × 2 columns
Thanks.
This should do it
dataset2_SAX_value_counts = dataset2["SAX"].value_counts()
dataset1["no. of occurrences"] = dataset1["pattern"].apply(lambda x: dataset2_SAX_value_counts.loc[x])

Extracting a sub data frame from original data frame by specifying the multi index

First, I made a following data frame which has two indeices file_id and obj_id so the left two data are not columns.
file_id obj_id val_1 val_2 ...
'file_1' 0 111 222
2 111 222
4 413 1231
'file_2' 5 111 222
27 111 222
3 413 1231
9 413 1231
'file_3' 0 111 222
2 111 222
4 413 1231
...
I want to extract multiple rows specified by file_id and create a new data frame maintaining the original structure. For example, if I have a list ['file_1', 'file_3'], the desired output is
file_id obj_id val_1 val_2 ...
'file_1' 0 111 222
2 111 222
4 413 1231
'file_3' 0 111 222
2 111 222
4 413 1231
I firstly tried to drop all unnecessary rows but it requires specifying the second index obj_id so
df.drop(['file_2', 'file_4', 'file_5' ...]) # Throw exception
df.drop(('file_2', 5)) # Works but leave other rows in file_2, I need to drop all row in file_2
file_id obj_id val_1 val_2 ...
'file_2' 27 111 222
3 413 1231
9 413 1231
...
As obj_id depends on each file_id, this dropping method does not work unless there is something like wild card. Also dropping requires many steps, I wish I could simply extract rows by list of file_id. Is there any solution?
Just posting my comment as an answer so OP can mark this as answered:
Since file is the outer-most level of your MultiIndex you should just need to use .loc`:
df.loc[["file_1", "file_2"], :]

How to combine CSVs in Python?

So I have two CSV files, and wish to merge them to create one file. My first CSV file has names and balances from 2018, while the second one has names and balances from 2019.
For Eg
2018
ABC 123
XYZ 456
2019
ABC 123
PQR 234
Final Output Should Look Like
ABC 123 123
XYZ 456 0
PQR 0 234
I just don't understand how to do this with Pandas. I am new to python and this assignment was given this morning. This is something which will work as FULL OUTER JOIN if I was working in SQL, but I have no clue how to implement this in Python
table :test2018.csv
Name T1
ABC 123
XYZ 456
table test2019.csv
Name T1
ABC 123
PQR 234
import pandas as pd
train =pd.read_csv('test2018.csv')
train2 =pd.read_csv('test2019.csv')
train.head()
train2.head()
t1=pd.merge(train,train2,on='Name',how='outer')
print(t1)

Adding column from dataframe with different structure

I have the following two dataframe structures:
roc_100
max min
industry Banks Health Banks Health
date
2015-03-15 3456 456 345 567
2015-03-16 6576 565 435 677
2015-03-17 5478 657 245 123
and:
roc_100
max min
date
2015-03-15 546 7856
2015-03-16 677 456
2015-03-17 3546 346
As can be seen the difference between the two dataframes is that the bottom one doesn't have an 'industry'. But the rest of the dataframe structure is the same, ie: it is also has dates along the left, and is grouped by roc_100, under which is max and min.
What I need to do is add the columns from the bottom dataframe to the top dataframe, and give the added columns an industry name, eg: 'benchmark'. The resulting dataframe should be as follows:
roc_100
max min
industry Banks Health Benchmark Banks Health Benchmark
date
2015-03-15 3456 456 546 345 567 7856
2015-03-16 6576 565 677 435 677 456
2015-03-17 5478 657 3546 245 123 346
I have tried using append and join, but neither option has worked so far because the one dataframe has an 'industry' and the other doesn't.
Edit:
I have managed to merge them correctly using:
industry_df = industry_df.merge(benchmark_df, how='inner', left_index=True, right_index=True)
The only problem now is that the newly added columns still don't have an 'industry'.
This means that if I just want one industry, eg: Health, then I can do:
print(industry_df['roc_100', 'max', 'Health'])
That works, but if I want to print all the industries including the newly added columns I can't do that. If I try:
print(industry_df['roc_100', 'max'])
This only prints out the newly added columns because they are the only ones which don't have an 'industry'. Is there a way to give these newly merged columns a name ('industry')?
You can use stack() and unstack() to bring two dataframes to identical index structures with industries as columns. Then assign new benchmark column. Last step - restore initial index/column structure by same stack() and unstack().

Categories