The following information for over 5000 job opening listed on a government portal for virtual career fairs was scraped: title, company, date opening posted, job level, contract type, location, salary, job description, requirements, closing date for application, and url to the job posting. The url is also included for cross-checking in the event that the scraped data appears strange.
This was done using selenium and BeautifulSoup and the code can be found here. Some data cleaning/ understanding/ exploration is also done thereafter.
The output was saved as a csv ('engine_jobs_310720.csv') and uploaded to github here. The date range of the job postings in the dataset was between 2020-07-02 and 2020-07-30.
Here're some statistics on the job openings. It seems like 1 in 3 are junior level jobs (i.e. Fresh/entry level, Junior Executive). Also, we have 36.5% hiring for permanent positions - which can be a sign of financial stability of companies. However, there are some inconsistencies in the way companies indicate information for contract type, where some mentioned full time without indicating permanent/ contract, hence the proportion of permanent positions might be more than 36.5%. The median salary is about $2500-3600. In addition, majority (more than 70%) of companies give a window period of 30 days for application.
More bivariate analysis can be done. Further data cleaning and analysis can also be done on job descriptions/ requirements that would involve text mining. Also job descriptions and requirements can be merged together to analyze as a whole. Due to the way the site was designed or the way organisations fill out the page, tags were inconsistently assigned to the text sections (i.e. job description/ requirement) of the page. When I was writing the script, I scraped the information tied to the different tags into two columns separately, while both job description/ requirement might be assigned to the same tag for that particular job posting and hence we see missing/ illogical information in one of the columns.