What are the factors affecting Graduate Admissions in America for Students?

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os

# Any results you write to the current directory are saved as output.
import seaborn as sns
import matplotlib.pyplot as plt
# reading dataset

#importing plotly
import plotly.plotly as py
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
import cufflinks as cf
['Admission_Predict_Ver1.1.csv', 'binary.csv']

Data Statistics and sneek-peek into the data:

In [2]:
#General data statistics
Serial No. GRE Score TOEFL Score University Rating SOP LOR CGPA Research Chance of Admit
0 1 337 118 4 4.5 4.5 9.65 1 0.92
1 2 324 107 4 4.0 4.5 8.87 1 0.76
2 3 316 104 3 3.0 3.5 8.00 1 0.72
3 4 322 110 3 3.5 2.5 8.67 1 0.80
4 5 314 103 2 2.0 3.0 8.21 0 0.65
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 9 columns):
Serial No.           500 non-null int64
GRE Score            500 non-null int64
TOEFL Score          500 non-null int64
University Rating    500 non-null int64
SOP                  500 non-null float64
LOR                  500 non-null float64
CGPA                 500 non-null float64
Research             500 non-null int64
Chance of Admit      500 non-null float64
dtypes: float64(4), int64(5)
memory usage: 35.2 KB
Serial No. GRE Score TOEFL Score University Rating SOP LOR CGPA Research Chance of Admit
count 500.000000 500.000000 500.000000 500.000000 500.000000 500.00000 500.000000 500.000000 500.00000
mean 250.500000 316.472000 107.192000 3.114000 3.374000 3.48400 8.576440 0.560000 0.72174
std 144.481833 11.295148 6.081868 1.143512 0.991004 0.92545 0.604813 0.496884 0.14114
min 1.000000 290.000000 92.000000 1.000000 1.000000 1.00000 6.800000 0.000000 0.34000
25% 125.750000 308.000000 103.000000 2.000000 2.500000 3.00000 8.127500 0.000000 0.63000
50% 250.500000 317.000000 107.000000 3.000000 3.500000 3.50000 8.560000 1.000000 0.72000
75% 375.250000 325.000000 112.000000 4.000000 4.000000 4.00000 9.040000 1.000000 0.82000
max 500.000000 340.000000 120.000000 5.000000 5.000000 5.00000 9.920000 1.000000 0.97000

Key Highlights from dataset:

  • Average GRE Score: 316.47
  • Average TOEFL Score: 107.19
  • Average CGPA: 8.58
  • With Research: 56% of applicants

Checking if plotly and cufflinks are working correctly

In [3]:
#data = [go.Histogram(x=df["GRE Score"])]
# checking if plotly and cufflinks are working correctly
df['GRE Score'].iplot(kind="hist", bins=40,title="GRE Score Distribution")
In [4]:
layout1 = cf.Layout(
df.corr().iplot(kind='heatmap',colorscale='spectral', title = 'Correlation between different maps', 

Following are highest correlated items with Chance of admit:

  • CGPA
  • GRE Score
  • TOEFL Score
In [5]:
df['Admit Chance']=pd.cut(np.array(df['Chance of Admit ']),3, labels=["bad", "medium", "good"])

How good is your Acceptance chance to a University based on your Scores:

NOTE: You can even turn the markers on and off by clicking on the legends in the below charts

In [6]:
scores_attr=['CGPA', 'GRE Score', 'TOEFL Score']
for i in scores_attr:
    df.iplot(x=i,y='University Rating',categories='Admit Chance',colors=['green','blue','red'],
            xTitle=i,yTitle='University Rating',title=f'Chances of Acceptance based on your {i}')

How does your SOP and LOR affect your chances of getting accepted?

In [7]:
df_grouped=df.groupby(['SOP','LOR ','Admit Chance']).size().reset_index(name='counts')
df_grouped.iplot(kind='bubble',x='SOP',y='LOR ',xTitle='SOP',yTitle='LOR',title='Distribution of SOP and LOR with acceptance chances',
                 size='counts',text='Admit Chance',colors=df_grouped['Admit Chance'].map(color_dict).tolist())

This one is quite natural, students with good SOPs and good LORs have better acceptance chances. Although there are some exceptions.

Ideally students who are good at academics should have good GRE and TOEFL Score. Lets check this hypothesis below

Zoom in, rotate, check the values in the 3d plot below

In [8]:
studious_students=df[df['CGPA'] > 8]
studious_students.iplot(kind='scatter3d', x='GRE Score', y='TOEFL Score',z='CGPA',mode='markers', xTitle='GRE Score',yTitle='TOEFL Score',zTitle='CGPA',
                        title='GRE vs TOEFL vs CGPA')

Our hypothesis seems to be true.

Now, just to check the relationship between the SOP and LOR of students with Research. General guess would be students with research should have a good LOR and SOP.

In [9]:
df_research_grouped=df.groupby(['SOP','LOR ','Research']).size().reset_index(name='counts')
In [10]:
import plotly.tools as tls

fig = tls.make_subplots(rows=1, cols=2, shared_yaxes=True)
fig.append_trace({'x': df_non_research.SOP, 'y': df_non_research['LOR '],'text':df_non_research['counts'],'type': 'scatter', 'name': 'Non Research','mode':'markers'}, 1, 1)
fig.append_trace({'x': df_research.SOP, 'y': df_research['LOR '], 'type': 'scatter','text':df_research['counts'], 'name': 'Research','mode':'markers'}, 1, 2)
fig['layout'].update(hovermode= 'closest')
fig['layout'].update(title='SOP vs LOR for applicants with Research & Non Research experience')

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y1 ]

If you hover around and check the counts at the top right corners for both the plots. It is evident that research does help making your SOP and LOR better.

To Summarize:

  • CPGA plays the most important role in admissions followed by GRE score and TOEFL.
  • Good SOPs and LORs are essential to get into the best universities
  • Research makes your SOP and LOR better.
  • Studious students generally tend to do good at GRE and TOEFL.

Disclamer: This is a very small dataset and the comments above are in accordance to the data given. To perform more analysis, it is essential that we have more data points and features.

In [ ]: