I am looking for a detailed web crawl of any website.
I am aiming to crawl each page of a website and pick only certain information to finally store in a database (suitable, to be suggested by you).
So, input will be the domain and you need to find a way to compile all the URLs and then collect info as in the excel sheet.
- Tab “Crawled URLs” will list out all the URLs of the sites
- Tab “Internal Links Raw Data” will list out all the specifics of the internal links
Now, for each crawl, you may need to record them under a unique crawl ID. This is the 1st phase of the project. We will expand the scope once we get the data correctly and reliably for large websites.
I can explain the details of the required information in the attached sheet.
To qualify for serious consideration of your proposal, you must provide the following in your bid:
- What Python library/package you will use and why
- What are the challenges you foresee and how you will overcome them? It is extremely important to get details here. This is the chance to show how good a fit you are for this project.
- What is your suggestion for data storage and why?
- What similar project did you do earlier and whether I can check that in action?
Please note without the points above in your bid, it is likely that we will not consider the bid seriously.
Hello sir,
I am a python developer with more than 2 years of experience. I have done many projects in past. I can work on :
1. Web Scraping / Data Science / ML
2. Django
3. APP development
4. C/C++
5. Wordpress
Lets discuss your project in more detail in chat for better understanding.
Please look at my profile once. Thanks.
Jaibhan Singh Gaur.
Hi,
The attachment show some of your requirements. I would like to work on this project, but would like to ask some questions to make things clear. Hereunder my answers to your questions:
1- Python Selenium, BeautifulSoup, Requests, Pandas and other relevant libs
2- The challenge is to bypass the detections as much as it could, and to store the scraped data ready for further analysis.
3- PostgreSQL for its performance and integration capabilities.
4- Scraped different datasets from different sites.
Thanks
Hello,
I am interested to work on this project.
I plan to use libraries like requests, bs4 and selenium. Requests for making http requests to the page, bs4 for scraping and filtering the site html, selenium for dynamic scraping and rendering javascript.
One of the challenges i see is getting the urls of the domain. I am a bit confused about what you mean here. Hopefully you can explain better in chat.
I feel csv is a good choice for data storage because it becomes easier to access if the data gets large.
All my previous projects can be found on my profile.
Thank you for reading this proposal
Hello:
After reading in detail the requirements of your project and concluding that they match my areas of knowledge and skills, I would like to introduce myself.
My name is Anthony Muñoz and I am the lead engineer for DS Pro IT agency. I have worked for over 10 years in Backend and software development and have successfully done multiple jobs on this and other Freelance platforms. It will be a pleasure to work together to make your project a reality.
Please feel free to contact me. I´m looking forward to working with you. I really appreciate your time and remain attentive to any request or question.
Greetings
Hi. I’m experienced Data engineer, I use Python and MySQL/Oracle/Hive databases in my professional life.
I’m experienced in Data mining so crawling is not a bug deal for me. I’m doing PhD research which includes website crawling and I’m experienced in this area.
Earlier I worked in hosting company, I understand how web servers work and how to collect desired data.
I offer high quality project execution with a professional approach. Customer’s satisfaction is a priority for me and result is guaranteed.
Answering your questions:
1) Simple requests library will handle this task;
2) Some requests can be detected as malicious action so we should prepare to use other approaches like distributed crawling. Honestly I didn’t crawl all websites but I’m ready to face difficulties then solve them;
3) MySQL database can handle billions of rows. Your table doesn’t look to wide, I think everything will work fine. In other hand, if storage will be too little, we also can use other data storages. First of all it depends on your current hardware;
4) I can share a little web scraping project which scrapes data from website and pushes in table. If you’re interested I will send you a link.
Thank you and have a great day:)
Hi there,
I am web scrapping and automation expert with more than 3+ years of experience.
I have seen you requirements and according to them i would use beautifulsoup and selenium as library.
There might be one problem that the scrapper could face is website detects the bot and to overcome come this we can bypass the captcha.
And about crawled data if you want to upload the data to you website or something than i would prefer to store them in sql and if not you can simply store them in csv file
Thank you,
waiting for your response :)
Hey Dear
We are 45 Persons team and my deliver some services .
1. React Native Experts and Developer
2. Digital Marketing (social media & management)
3. Designing (photoshop and illustrator)
4. Android Development (java Kotlin)
5. Web Designing ( Wordpress And Zoomla)
6. crypto wallet (block chain , erc)
7. Bot
8. Developer (laraval, flutter, ionic)
9. Autocad
10. Python Experts
Thanks Dear
Have A nice day
Hi
My name is Mohamed Khaled I'm a Data Analyst
I can do this job for you as I/O console application as I've made a similar project in python it's goal is to scrap amazon search results on whatever your input such as "Laptops" it starts scraping all the results pages from 1 to 50 as it takes the next link and save the data (price, name) into an Excel sheet.
I use Selenium as it can scrap the website and move dynamically and the challenge is captcha passing and we can handle it using 2captcha or any console script in Javascript or to play with the headers.