So I have two CSV files, and wish to merge them to create one file. My first CSV file has names and balances from 2018, while the second one has names and balances from 2019.
For Eg
2018
ABC 123
XYZ 456
2019
ABC 123
PQR 234
Final Output Should Look Like
ABC 123 123
XYZ 456 0
PQR 0 234
I just don't understand how to do this with Pandas. I am new to python and this assignment was given this morning. This is something which will work as FULL OUTER JOIN if I was working in SQL, but I have no clue how to implement this in Python
table :test2018.csv
Name T1
ABC 123
XYZ 456
table test2019.csv
Name T1
ABC 123
PQR 234
import pandas as pd
train =pd.read_csv('test2018.csv')
train2 =pd.read_csv('test2019.csv')
train.head()
train2.head()
t1=pd.merge(train,train2,on='Name',how='outer')
print(t1)
Related
I have a dataframe like this
country data_fingerprint organization
US 111 Tesco
UK 222 IBM
US 111 Yahoo
PY 333 Tesco
US 111 Boeing
CN 333 TCS
NE 458 Yahoo
UK 678 Tesco
I want those data_fingerprint for where the organisation and country with top 2 counts exists
So if see in organization top 2 occurrences are for Tesco,Yahoo and for country we have US,UK .
So based on that the output of data_fingerprint should be having
data_fingerprint
111
678
What i have tried for organization to exist in my complete dataframe is this
# First find top 2 occurances of organization
nd = df['organization'].value_counts().groupby(level=0, group_keys=False).head(2)
# Then checking if the organization exist in the complete dataframe and filtering those rows
new = df["organization"].isin(nd)
But i am not getting any data here.Once i get data for this I can do it along with country
Can someone please help to get me the output.I have less data so using Pandas
here is one way to do it
df[
df['organization'].isin(df['organization'].value_counts().head(2).index) &
df['country'].isin(df['country'].value_counts().head(2).index)
]['data_fingerprint'].unique()
array([111, 678], dtype=int64)
Annotated code
# find top 2 most occurring country and organization
i1 = df['country'].value_counts().index[:2]
i2 = df['organization'].value_counts().index[:2]
# Create boolean mask to select the rows having top 2 country and org.
mask = df['country'].isin(i1) & df['organization'].isin(i2)
# filter the rows using the mask and drop dupes in data_fingerprint
df.loc[mask, ['data_fingerprint']].drop_duplicates()
Result
data_fingerprint
0 111
7 678
You can do
# First find top 2 occurances of organization
nd = df['organization'].value_counts().head(2).index
# Then checking if the organization exist in the complete dataframe and filtering those rows
new = df["organization"].isin(nd)
Output - Only Tesco and Yahoo left
df[new]
country data_fingerprint organization
0 US 111 Tesco
2 US 111 Yahoo
3 PY 333 Tesco
6 NE 458 Yahoo
7 UK 678 Tesco
You can do the same for country
I have a df which looks like this :
CustomerID
CustomerName
StoreName
101
Mike
ABC
102
Sarah
ABC
103
Alice
ABC
104
Michael
PQR
105
Abhi
PQR
106
Bill
XYZ
107
Roody
XYZ
Now I want to seperate out the 3 stores in 3 seperate dfs.
For this i created a list of store names
store_list = df.select("StoreName").distinct().rdd.flatMap(lambda x:x).collect()
Now I want to iterate through this list and filter out different stores in diff dfs.
for i in store_list:
df_{i} = df.where(col("storeName") == i)
The code has syntax errors obviously, but thats the approach I am thinking. I want to avoid Pandas as the datasets are huge.
Can anyone help me with this?
Thanks
I have one excel document which contains sport column, in which sports name and sports persons names are available. If I clicked on sports name sports persons names are disappears i.e. sports persons names are children's of the sports name.
Please look at the data below:
If I clicked on cricket then ramesh, suresh,mahesh names are disappears i.e. cricket is the parent of ramesh, suresh and mahesh like same football is the parent of pankaj, riyansh, suraj.
I want to read this excel document and convert in the python pandas Dataframe. I tried to read it with pandas pivot_table but I'm not getting any success.
I tried to read this excel sheet and converted into a dataframe.
df = pd.read_excel("sports.xlsx",skiprows=7,header=0)
d = pd.pivot_table(df,index=["sports"])
print d
But I'm getting all the sports values in single column I want to split it by sports name and it's corresponding sports persons name.
Expected Output:
sports_name player_name age address
cricket ramesh 20 aaa
cricket suresh 21 bbb
cricket mahesh 22 ccc
football pankaj 24 eee
football riyansh 25 fff
football suraj 26 ggg
basketball rajesh 28 iii
basketball abhijeet 29 jjj
pandas.pivot_table is there to support data analysis and helps you to create pivot tables similar to excel, not to read excel pivot tables.
Create a spreadsheet-style pivot table as a DataFrame. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame
Example from Documentation
>>> df
A B C D
0 foo one small 1
1 foo one large 2
2 foo one large 2
3 foo two small 3
4 foo two small 3
5 bar one large 4
6 bar one small 5
7 bar two small 6
8 bar two large 7
>>> table = pivot_table(df, values='D', index=['A', 'B'],
... columns=['C'], aggfunc=np.sum)
>>> table
small large
foo one 1 4
two 6 NaN
bar one 5 4
two 6 7
Now to help you on the problem, I created a sample data set and a pivot table.
Then read the excel sheet into pandas dataframe. This dataframe contains nans to be replaced using df.fillna(method='ffill')
df = pd.read_excel(pviotfile,skiprows=12,header=0)
df=df.fillna(method='ffill')
print (df)
output
Sports Name Address Age
0 basketball Abhijit 129 ABC 20
1 basketball Rajesh 128 ABC 20
2 Cricket Mahesh 123 ABC 20
3 Cricket Ramesh 126 ABC 20
4 Cricket Suresh 124 ABC 20
5 Football Riyash 125 ABC 20
6 Football suraj 127 ABC 20
I have a file with the following structure (there are around 10K rows):
User Destination Country
123 34578 US
123 34578 US
345 76590 US
123 87640 MX
890 11111 CA
890 88888 CA
890 99999 CA
Each user can go to multiple destinations that are located in different countries. I need to find out the number of unique destinations users go to, median and mean of unique destinations. Same for countries. I don't know how to use groupby to achieve that. I managed to get the stats by placing everything in nested dictionary, but I feel that there may be a much easier way to the approach by using pandas dataframes and groubpy.
I am not looking for a count on each groupby section. I am looking for something like: on average, users visit X destinations and Y countries. So, I am looking for aggregate stats over all groupby results.
Edit. Here is my dict approach:
from collections import defaultdict
test=lambda: defaultdict(test)
conn_l=test()
with open('myfile') as f:
for line in f:
current=line.split(' ')
s = current[0]
d = current[1]
if conn_l[s][d]:
conn_l[s][d]+=1
else:
conn_l[s][d]=1
lengths=[]
for k,v in conn_l.items():
lengths.append(len(v))
I think this one might be a little harder than it looks at first glance (or perhaps there is a simpler approach than what I do below).
ser = df.groupby('User')['Destination'].value_counts()
123 34578 2
87640 1
345 76590 1
890 11111 1
99999 1
88888 1
The output of value_counts() is a series, you can then do groupby a second time to get a count of the unique destinations.
ser2 = ser.groupby(level=0).count()
User
123 2
345 1
890 3
That's for clarity but you could do it all on one line.
df.groupby('User')['Destination'].value_counts().groupby(level=0).count()
With ser2 you ought to be able to do all the other things.
ser2.median()
ser2.mean()
Agree with JohnE that counting the number of entries for User is not obvious.
I found that:
df2 = df.groupby(['User','Destination'])
df3 = df2.size().groupby(level=0).count()
also works, the only difference being that df2 is a Dataframe.groupby rather than a series.groupby, so potentially has a bit more functionality since it retains the Country information.
A trivial example:
for name, group in df2:
print name, group
(123, 34578) User Destination Country
0 123 34578 US
1 123 34578 US
(123, 87640) User Destination Country
3 123 87640 MX
(345, 76590) User Destination Country
2 345 76590 US
(890, 11111) User Destination Country
4 890 11111 CA
(890, 88888) User Destination Country
5 890 88888 CA
(890, 99999) User Destination Country
6 890 99999 CA
ser = df.groupby('User')['Destination']
for name, group in ser:
print name, group
123 0 34578
1 34578
3 87640
Name: Destination, dtype: int64
345 2 76590
Name: Destination, dtype: int64
890 4 11111
5 88888
6 99999
Name: Destination, dtype: int64
Given the following DataFrame how can I filter groups based if a value is in the group?
For example in this table I would like to retain the groups which contain "FB" in the department
Job Dept
123 TC
123 TC
123 TC
123 FB
123 FB
123 MD
456 FB
456 FB
456 FB
456 FB
I would like the output to a table or dataframe like this.
Job Dept
123 TC
123 TC
123 TC
123 FB
123 FB
123 MD
I know I can check if "TC" in in the column by using
df['Dept'].isin(["TC"].any()
I don't know how to use apply, or whatever else, to figure this out by group and return a dataframe of only those groups.
I just figured out the answer. I was looking at apply but I needed to use filter
df.groupby("Job").filter(lambda x : x["Dept"].isin(["TC"]).any())
You can index:
df[df['dept'] == 'FB']
http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing