思路:1.获取拉勾网搜索到职位的页数
2.调用接口获取职位id
3.根据职位id访问页面,匹配出关键字
url访问采用unirest,由于拉钩反爬虫,短时间内频繁访问会被限制访问,所以没有采用多线程,而且每个页面访问时间间隔设定为10s,通过nokogiri解析页面,正则匹配只获取技能要求中的英文单词,可能存在数据不准确的情况
数据持久化到excel中,采用ruby erb生成word_cloud报告
爬虫代码:
require 'unirest'require 'uri'require 'nokogiri'require 'json'require 'win32ole'@position = '测试开发工程师'@city = '杭州'# 页面访问def query_url(method, url, headers:{}, parameters:nil) case method when :get Unirest.get(url, headers:headers).body when :post Unirest.post(url, headers:headers, parameters:parameters).body endend# 获取页数def get_page_num(url) html = query_url(:get, url).force_encoding('utf-8') html.scan(/(\d+)<\/span>/).first.firstend# 获取每页显示的所有职位的iddef get_positionsId(url, headers:{}, parameters:nil) response = query_url(:post, url, headers:headers, parameters:parameters) positions_id = Array.new response['content']['positionResult']['result'].each{|i| positions_id << i['positionId']} positions_idend# 匹配职位英文关键字def get_skills(url) puts "loading url: #{url}" html = query_url(:get, url).force_encoding('utf-8') doc = Nokogiri::HTML(html) data = doc.css('dd.job_bt') data.text.scan(/[a-zA-Z]+/)end# 计算词频def word_count(arr) arr.map!(&:downcase) arr.select!{ |i| i.length>1} counter = Hash.new(0) arr.each { |k| counter[k]+=1 } # 过滤num=1的数据 counter.select!{|_,v| v > 1} counter2 = counter.sort_by{|_,v| -v}.to_h counter2end# 转换def parse(hash) data = Array.new hash.each do |k,v| word = Hash.new word['name'] = k word['value'] = v data << word end JSON dataend# 持久化数据def save_excel(hash) excel = WIN32OLE.new('Excel.Application') excel.visible = false workbook = excel.Workbooks.Add() worksheet = workbook.Worksheets(1) # puts hash.size (1..hash.size+1).each do |i| if i == 1 # puts "A#{i}:B#{i}" worksheet.Range("A#{i}:B#{i}").value = ['关键词', '频次'] else # puts i # puts hash.keys[i-2], hash.values[i-2] worksheet.Range("A#{i}:B#{i}").value = [hash.keys[i-2], hash.values[i-2]] end end excel.DisplayAlerts = false workbook.saveas(File.dirname(__FILE__)+'\lagouspider.xls') workbook.saved = true excel.ActiveWorkbook.Close(1) excel.Quit()end# 获取页数url = URI.encode("https://www.lagou.com/jobs/list_#@position?city=#@city&cl=false&fromSearch=true&labelWords=&suginput=")num = get_page_num(url).to_iputs "存在 #{num} 个信息分页"skills = Array.new(1..num).each do |i| puts "定位在第#{i}页" # 获取positionsid url2 = URI.encode("https://www.lagou.com/jobs/positionAjax.json?city=#@city&needAddtionalResult=false") headers = {Referer:url, 'User-Agent':i%2==1?'Mozilla/5.0':'Chrome/67.0.3396.87'} parameters = {first:(i==1), pn:i, kd:@position} positions_id = get_positionsId(url2, headers:headers, parameters:parameters) positions_id.each do |id| # 访问具体职位页面,提取英文技能关键字 url3 = "https://www.lagou.com/jobs/#{id}.html" skills.concat get_skills(url3) sleep 10 endendcount = word_count(skills)save_excel(count)@data = parse(count)
效果展示: