As a young web scraper, I have a dream, a dream of one day myself becoming a Data Scientist. Yet, in answer to the question “How to become a data scientist?” Google gives more than half a million very contradicting answers. Especially contradicting when it comes to the level of education for the average Data Scientist. Do I need a bachelor’s, master’s or a PhD degree to find a job? Is higher always better? Do I keep studying or start working? The eternal conundrum!
Let’s start from scrape. What information about education level could I get from scraping job postings? I decided to concentrate on the academic qualifications for Data Scientist jobs in Canada using Glassdoor job postings as my source of information.
So, it’s scraping time! The procedure I used for web scraping was taken from here. I scraped a total of 567 Data Scientist job postings from Glassdoor, using ‘Canada’ as the location. For each posting the Job title, Company name, City, and Job description were extracted. I then used Regex for Python 3.6 to snoop into the job descriptions and scoop out the academic qualifications required by employers.
Finally I drew the following chart: As it follows from the above pie chart, 45% of all job postings require only a bachelor degree. This means a candidate with a bachelor’s degree has a good chance of getting a job interview, provided all other requirements are met. At the same time, 24% (=12%+12%) would prefer folks with an advanced degree, but don’t mind trying out a job seeker with only a bachelor’s degree as well. This gives a range from 45% to 69%. The middle point of this range is equal to 56%. Thus, roughly about 50% of employers would consider folks with only a bachelor degree, provided all their other requirements are met. On the other hand, it also follows from the chart that in 32% (=7%+8% +17%) of cases employers will only consider candidates with advanced degrees. Still, the chances of bachelor’s degree holders are pretty good!
In this section I will describe how I got the above results.
Scraping is easy. You only need Python 3.6 and Beautiful soup 4.0 for parsing HTML code. I blindly followed this Web Scraping Tutorial. After I scraped 567 job postings I found myself in a Needle and Hay situation. The combined word count of all job descriptions was over 200,000 words, that’s a lot of information! Just for comparison: Tolstoy’s War and Peace has a word count at 587,287 words.
First, I started reading what I got. To my surprise I discovered at once that some postings have nothing to do with data science. For example, one posting had the following title: Clinical Scientist – Oncology – Prostate – (Home-Based). To examine this problem I created a frequency table for job titles.
Here is the top of the frequency table:
|4.||Business Intelligence Analyst||10|
|5.||Senior Data Scientist||9|
|6.||Senior Data Analyst||8|
|7.||Machine Learning Engineer||5|
|9.||Senior Manager, Data Scientist and Machine Learning…||4|
|10.||Quantitative Analyst, Quantitative Solutions – Analytics||4|
|12.||Big Data Developer||4|
Using the above table I browsed through a couple hundred postings. Although some were extraneous, the vast majority of them were definitely still related to data science. But now, I discovered there were more serious issues:
- Some employers don’t directly indicate the academic qualification in the job posting.
- Some postings are in French, which I don’t know (Diplôme universitaire!)
- HR people have different ways to refer to the same qualification (MS or M.S. or MSc. or Master Degree or even Masters Degree etc.). Some HR folks are even too busy to check for spelling errors (i.e. “degre” instead of “degree”).
I used Regex for Python 3.6 to process the raw text data. I put together the set of key words and expressions to spot the academic qualification in the job postings. Often the same key word can have different meanings. For example, it is important to distinguish between ‘MS degree’ and ‘MS Access’, and not to count two master’s degrees by mistake. Regex positive and negative lookahead assertions were used to differentiate such expressions. I made several iterations of the analysis process, after each iteration I modified the set of key words and expressions to improve the quality of the text analysis. By the end I managed to achieve about 95% accuracy in determining the academic qualification stated in the job postings.
The final statistics are provided in the table below:
|4.||Bachelor or Master||48|
|5.||Master or PhD||72|
As it follows from this table almost a quarter of all postings belong to the “Undefined” category. Another surprising result! Usually postings from this category have no key words clearly indicating the required job qualification. Some employers are not concerned about the specific academic qualification and the job description contains the information about the required qualification only between the lines. I guess such postings require a special AI study similar to sentiment analysis. On the other hand, such postings cannot skew significantly the final statistics, since such postings should contribute to all other categories in the table. Obviously, postings in French and postings which were not correctly detected also belong to the “Undefined” category. Luckily, the relative number of such postings is negligibly small and can be ignored.
As a result, the final statistics were calculated without the postings from the “Undefined” category.