大麦网的网站建设/网站优化seo教程
一, 保存抓取到的数据为json文件.
首先新建一个专门保存json文件的pipieline类.
import codecs, json
# codecs类似于open,但会帮我们处理一些编码问题.
class JsonWithEncodingPipeline:
def __init__(self):
self.file = codecs.open('save_file.json', 'w', encoding='utf-8')
def process_item(self, item, spider):
lines = json.dumps(dict(item), ensure_ascii=False) + '\n'
self.file.write(lines)
return item
def spider_close(self, spider):
self.file.close()
上边是我们自己写的保存为json的功能,scrapy也提供了专门保存json的功能
from scrapy.exporters import JsonItemExporter
# exporters.py模块中还有csv, xml,pickle,marshal等文件的保存类
class JsonExporterPipeline:
def __init__(self):
self.file = open('save_file.json', 'wb')
self.exporter = JsonItemExporter(self.file, encoding='utf-8',ensure_ascii=False)
self.exporter.start_exporting()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
def close_spider(self, spider):
self.exporter.finish_exporting()
self.file.close()
2. 把自己写的上边的两个pipeline之一加入settings.py中,设定好优先级.
二, 把数据异步保存到mysql
scrapy支持异步的方式高并发抓取数据,抓取速度会非常快,而普通的pymysql以及sqlalchemy都是同步阻塞的,如果spider和download模块抓取速度非常快,mysql入库才去同步阻塞的方式,速度就会很慢,会拖累项目的整体速度,所以有必要情况下采取异步入库mysql.
from twisted.enterprise import adbapi
import MySQLdb
import MySQLdb.sursors
class MysqlTwistedPipeline:
def __init__(self, db_pool):
self.db_pool = db_pool
@classmethed
def from_settings(cls, settings):
db_parmas = dict(
host = settings['MYSQL_HOST'],
db = settings['MYSQL_DBNAME'],
user = settings['MYSQL_USER'],
passed = settings['MYSQL_PASSWORD'],
charset = 'utf-8',
cursorclass = MySQLdb.cursors.DictCursor,
use_unicode=True,
) # db_params里的key要和ConnectionPool里的参数对应.
db_pool = adbapi.ConnectionPool("MySQLdb", **db_params)
return cls(db_pool)
def process_item(self, item, spider):
query = self.db_pool.runInteraction(self.do_insert, item)
query.addErrback(self.handle_error [,item, spider]) # 处理异常,
def do_insert(self, cursor, item): # db_pool.runInteraction会传一个cursor
insert_sql = 'insert into table_name(name, age) values(%s, %s)'
cursor.execute(insert_sql, (item['name'], item['age']))
def handle_error(self, failure [, item, spider]): # 处理错误的回调函数
print(failure)