发起人:烧茄子 初入职场

% 数学不好的PhD# 半吊子程序猿 /* 轻度 拖延症 + 强迫症 */<!-- 往全能小达人方向努力 -->// 憋不出最后一句了

回复 ( 4 )

  1. 洛克
    理由
    举报 取消
    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    # @Date    : 2015-11-10 16:21:59
    import requests
    from bs4 import BeautifulSoup
    url = 'http://www.nea.gov.sg/anti-pollution-radiation-protection/air-pollution-control/psi/historical-psi-readings/year/2014/month/4/day/1'
    
    req = requests.get(url)
    
    soup = BeautifulSoup(req.text, "html.parser")
    
    PSI_infos = soup.find('table', class_="text_psinormal").find(
        'tbody').find_all('tr')
    
    for info in PSI_infos:
        items = info.find_all('span', id=True)
        for item in items:
            print item.get_text(),
        print
    
  2. 叶泽心
    理由
    举报 取消

    蟹腰

    先上代码:

    # -*- coding: utf-8 -*-
    import requests
    import re
    import datetime
    import os
    import csv
    class Spider():
        def __init__(self):
            self.url=u'http://www.nea.gov.sg/anti-pollution-radiation-protection/air-pollution-control/psi/historical-psi-readings/year/yearNum/month/monthNum/day/dayNum'
            self.headers={
                        'Host': 'www.nea.gov.sg',
                        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:41.0) Gecko/20100101 Firefox/41.0',
                        'Connection': 'keep-alive'
                        }        
            zuhe1=u'<tr>.*?'
            zuhe2=u'<td>(.*?)</td>.*?'
            zuhe3=u'</tr>'
            self.zuhe=zuhe1+zuhe2*7+zuhe3
            
        def tool(self,x):
            x=re.sub(re.compile('<.*?>'),'',x)
            x=re.sub(re.compile('\n'),'',x)
            x=re.sub(re.compile('\r'),'',x)
            x=x.encode('utf-8')
            return x.strip() 
                
        def handleDate(self,year,month,day):
            #返回日期数据
            date=datetime.date(year,month,day)
    #        print date.datetime.datetime.strftime('%Y-%m-%d')
            return date  #日期对象
            
        def timeDelta(self,year,month):
            #计算一个月有多少天
            date=datetime.date(year,month,1)
            try:    
                date2=datetime.date(date.year,date.month+1,date.day)
            except:
                date2=datetime.date(date.year+1,1,date.day)  
            dateDelta=(date2-date).days
            return dateDelta
            
        def getPageContent(self,date):
            url=self.url
            url=url.replace(u'yearNum',str(date.year))
            url=url.replace(u'monthNum',str(date.month))
            url=url.replace(u'dayNum',str(date.day))
    #        print url
            r=requests.get(url)
            r.encoding='utf-8'
            pageContent=r.text
    #        f=open('content.html','w')
    #        f.write(pageContent.encode('utf-8'))
    #        f.close()
            return pageContent
            
        def getPageInfos(self,pageContent):
            pattern1=re.compile(u'<tbody>(.*?)</tbody>',re.S)
            result1=re.search(pattern1,pageContent)
            content1=result1.group(1)
            pattern2=re.compile(self.zuhe,re.S)
            infos=re.findall(pattern2,content1)
            return infos
        
        def saveInfo(self,info,date):
            fileName= 'psi/'+datetime.datetime.strftime(date,'%Y')+'/'+datetime.datetime.strftime(date,'%m')+'/'+datetime.datetime.strftime(date,'%d')+'.csv'
            if os.path.exists(fileName):
                mode='ab'
            else:
                mode='wb'
            csvfile=file(fileName,mode)
            writer=csv.writer(csvfile)
    #        if mode=='wb':
    #            writer.writerow(self.rowName)
            writer.writerow([self.tool(i) for i in info])
            csvfile.close()
    
        def mkdir(self,date):
            #创建目录
            path = 'psi/'+datetime.datetime.strftime(date,'%Y')+'/'+datetime.datetime.strftime(date,'%m')
            isExists=os.path.exists(path)
            if not isExists:
                os.makedirs(path)
                
        def saveAllInfo(self,infos,date):
            for (i,info) in enumerate(infos):
                self.mkdir(date)
                self.saveInfo(info,date)
    #            print 'save info from link',i+1,'/',len(infos)   
    
  3. 深海鱼
    理由
    举报 取消

    瓶颈主要在硬盘IO上,30线程爬网页 vs 单线程写入.csv文件。

    改成并发写入数据库应该会快很多。

    有时刚开始就会出现报递归Error,可是我代码中没用递归啊。

    Traceback (most recent call last):
      File "C:\Python 3.5\lib\multiprocessing\queues.py", line 241, in _feed
        obj = ForkingPickler.dumps(obj)
      File "C:\Python 3.5\lib\multiprocessing\reduction.py", line 50, in dumps
        cls(buf, protocol).dump(obj)
    RecursionError: maximum recursion depth exceeded
    
  4. 老王
    理由
    举报 取消

    @张天

    在Firefox里面用F12看了下,有几个js脚本获取失败,可以看到域名指向了谷歌,可能爬虫里面要想办法把这几个js脚本自己补上(至少jquery.min.js是很常见的,其他的可以再搜搜)

我来回答

Captcha 点击图片更换验证码